Top 20 Hive Interview Questions and Answers
Updated 14 Nov 2024
Q1. What are internal and external tables in hive?
Internal and external tables in Hive are two types of tables used to store data in Hive.
Internal tables store data in a default location managed by Hive
External tables store data in a location specified by the user
Internal tables are deleted when the table is dropped, while external tables retain data in the specified location
Example: CREATE TABLE internal_table (col1 INT, col2 STRING) STORED AS ORC;
Example: CREATE EXTERNAL TABLE external_table (col1 INT, col2 STRING) LOCATIO...read more
Q2. How do you eliminate null values in HIVE
Null values in HIVE can be eliminated using the COALESCE function.
Use the COALESCE function to replace null values with a default value.
Syntax: COALESCE(column_name, default_value)
Example: SELECT COALESCE(name, 'Unknown') FROM table_name;
Q3. Difference between Hive and Sql
Hive is a data warehousing tool for Hadoop while SQL is a language used to manage relational databases.
Hive is used for big data processing while SQL is used for relational databases.
Hive uses Hadoop Distributed File System (HDFS) while SQL uses traditional storage systems.
Hive queries are written in Hive Query Language (HQL) while SQL queries are written in Structured Query Language.
Hive is optimized for batch processing while SQL is optimized for transaction processing.
Hive...read more
Q4. What is partition in hive?
Partition in Hive is a way to organize data in a table into multiple directories based on the values of one or more columns.
Partitions help in improving query performance by allowing Hive to only read the relevant data directories.
Partitions are defined when creating a table in Hive using the PARTITIONED BY clause.
Example: CREATE TABLE table_name (column1 INT, column2 STRING) PARTITIONED BY (column3 STRING);
Q5. What are the analytical function does HIVE support
Hive supports various analytical functions for data processing and analysis.
Hive supports aggregation functions like SUM, COUNT, AVG, MIN, MAX, etc.
It also supports window functions like ROW_NUMBER, RANK, LAG, LEAD, etc.
Hive provides statistical functions such as CORR, COVAR_POP, STDDEV, etc.
It supports date and time functions like YEAR, MONTH, DAY, HOUR, etc.
Hive also offers string manipulation functions like CONCAT, SUBSTRING, LENGTH, etc.
Q6. What are ACID properties in HIVE table
ACID properties ensure data consistency and reliability in HIVE tables.
ACID stands for Atomicity, Consistency, Isolation, and Durability.
Atomicity ensures that a transaction is treated as a single unit of work.
Consistency ensures that the data remains in a valid state before and after the transaction.
Isolation ensures that concurrent transactions do not interfere with each other.
Durability ensures that once a transaction is committed, it is permanent and can survive system fa...read more
Q7. How you do data validation in hive
Data validation in Hive involves using built-in functions and custom scripts to ensure data accuracy and consistency.
Use built-in functions like IS NULL, IS NOT NULL, and COALESCE to check for missing or null values
Use regular expressions and pattern matching to validate data formats
Write custom scripts to perform more complex data validation tasks
Perform data profiling to identify potential data quality issues
Use data quality tools like Apache Nifi or Talend to automate data...read more
Q8. How to troubleshoot hive slowness
To troubleshoot hive slowness, check for resource contention, optimize queries, and monitor system performance.
Check for resource contention such as CPU, memory, and disk usage
Optimize queries by reducing data scanned and avoiding unnecessary joins
Monitor system performance using tools like Ganglia or Ambari
Consider partitioning tables to improve query performance
Use compression to reduce disk I/O and network traffic
Hive Jobs
Q9. Add data into a partitioned hive table
To add data into a partitioned hive table, you can use the INSERT INTO statement with the PARTITION clause.
Use INSERT INTO statement to add data into the table.
Specify the partition column values using the PARTITION clause.
Example: INSERT INTO table_name PARTITION (partition_column=value) VALUES (data);
Q10. explain details of hive meta store
Hive meta store stores metadata about Hive tables, partitions, columns, and storage location.
Hive meta store is a central repository that stores metadata information about Hive tables, partitions, columns, and storage location.
It stores this metadata in a relational database like MySQL, Derby, or PostgreSQL.
The metadata includes information such as table names, column names, data types, file formats, and storage location.
Hive uses the meta store to retrieve metadata informati...read more
Q11. Hive types of tables and difference between them
Hive has two types of tables - Managed and External. Managed tables are managed by Hive, while External tables are managed outside of Hive.
Managed tables are created using 'CREATE TABLE' command and data is stored in Hive's warehouse directory
External tables are created using 'CREATE EXTERNAL TABLE' command and data is stored outside of Hive's warehouse directory
Managed tables are deleted when the table is dropped, while External tables are not
Managed tables have full control...read more
Q12. Hive vs external table?
Hive tables store data in HDFS while external tables reference data stored outside HDFS.
Hive tables store data in HDFS, while external tables reference data stored outside HDFS.
External tables are useful when data needs to be shared across different Hive instances or when data is already stored in a different location.
Hive tables are managed by Hive and are deleted when the table is dropped, while external tables are not deleted when the table is dropped.
Example: Creating a H...read more
Q13. Hive External vs managed
Hive External vs managed
Hive External tables store data outside of the Hive warehouse directory
Managed tables store data in the Hive warehouse directory
External tables can be used to access data from different storage systems
Managed tables are easier to manage as Hive takes care of data storage and metadata
External tables require manual management of data and metadata
Q14. Hive table loads on incremental process
Hive table loads can be done incrementally by using partitioning or by using timestamp columns.
Partition the table based on a column like date or timestamp to load data incrementally
Use dynamic partition inserts to add new data to specific partitions
Utilize timestamp columns to filter and load only new data
Consider using tools like Apache NiFi or Apache Sqoop for efficient incremental data loading
Q15. Duplicate Records elimination in hive table
Use DISTINCT keyword in SELECT query to eliminate duplicate records in a Hive table
Use SELECT DISTINCT * FROM table_name to retrieve unique records
Consider using GROUP BY clause with appropriate columns to eliminate duplicates
Utilize the ROW_NUMBER() window function to assign a unique row number to each record and filter out duplicates
Q16. What are external internal tables in hive?
External and internal tables in Hive are two types of tables used to store data in Hive.
External tables store data outside of Hive's warehouse directory, while internal tables store data within the warehouse directory
External tables do not delete data when the table is dropped, while internal tables delete data
External tables are useful for sharing data between different Hive instances or with external systems
Internal tables are managed by Hive, while external tables are mana...read more
Q17. What is hive in big data
Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.
Hive uses a SQL-like language called HiveQL to query and manage large datasets stored in Hadoop
It allows users to write complex queries to analyze and process data
Hive organizes data into tables, partitions, and buckets for efficient querying
It is commonly used for data warehousing, data analysis, and data processing tasks
Q18. What are the tables in hive
Hive tables are used to store structured data in Hive, similar to tables in a traditional database.
Hive tables are created using the CREATE TABLE statement.
Tables can be partitioned based on one or more columns.
External tables in Hive store data outside of the default location in HDFS.
Managed tables store data in the default location in HDFS.
Tables can be queried using SQL-like syntax in HiveQL.
Q19. Architecture of hive,types of hive table, file formats in hive, dynamic partition in hive
Hive architecture, table types, file formats, and dynamic partitioning.
Hive architecture consists of metastore, driver, compiler, and execution engine.
Hive tables can be of two types: managed tables and external tables.
File formats supported by Hive include text, sequence, ORC, and Parquet.
Dynamic partitioning allows automatic creation of partitions based on data.
Q20. Advantages and disadvantages of Hive?
Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.
Advantages: SQL-like query language for querying large datasets, optimized for OLAP workloads, supports partitioning and bucketing for efficient queries.
Disadvantages: Slower performance compared to traditional databases for OLTP workloads, limited support for complex queries and transactions.
Example: Hive can be used to analyze large volumes of log data to ext...read more
Top Interview Questions for Related Skills
Interview Questions of Hive Related Designations
Interview experiences of popular companies
Reviews
Interviews
Salaries
Users/Month