Data Engineer
900+ Data Engineer Interview Questions and Answers
Q51. Hive types of tables and difference between them
Hive has two types of tables - Managed and External. Managed tables are managed by Hive, while External tables are managed outside of Hive.
Managed tables are created using 'CREATE TABLE' command and data is stored in Hive's warehouse directory
External tables are created using 'CREATE EXTERNAL TABLE' command and data is stored outside of Hive's warehouse directory
Managed tables are deleted when the table is dropped, while External tables are not
Managed tables have full control...read more
Q52. Methods of migrating Hive metdatastore to unity catalog in Databricks ?
Use Databricks provided tools like databricks-connect and databricks-cli to migrate Hive metadata to Unity catalog.
Use databricks-connect to connect to the Databricks workspace from your local development environment.
Use databricks-cli to export the Hive metadata from the existing Hive metastore.
Create a new Unity catalog in Databricks and import the exported metadata using databricks-cli.
Validate the migration by checking the tables and databases in the Unity catalog.
Q53. Design the generic tool or package using pyspark which allows to create connections to multiple databases like mysql, s3 or api. Fetch the result and do transformations like handling null values and then store ...
read moreDesign a generic tool in pyspark to connect to multiple databases, fetch results, handle null values, and store output in another database
Use pyspark to create a tool that can connect to databases like mysql, s3, or api
Implement functions to fetch data from the databases and perform transformations like handling null values
Utilize pyspark to store the transformed data in another database
Consider using pyspark SQL functions for data transformations
Q54. convert string of multiple lines with 'n' words to multiple arrays of fixed size: k, with no overlap of elements accross arrays.
Convert a string of multiple lines with 'n' words to multiple arrays of fixed size without overlap.
Split the string into individual words
Create arrays of fixed size 'k' and distribute words evenly
Handle cases where the number of words is not divisible by 'k'
Q55. 4) How to read json data using spark
To read JSON data using Spark, use the SparkSession.read.json() method.
Create a SparkSession object
Use the read.json() method to read the JSON data
Specify the path to the JSON file or directory containing JSON files
The resulting DataFrame can be manipulated using Spark's DataFrame API
Q56. 1. What is columnar storage,parquet,delta? Why it is used
Columnar storage is a data storage format that stores data in columns rather than rows, improving query performance.
Columnar storage stores data in a column-wise manner instead of row-wise.
It improves query performance by reducing the amount of data that needs to be read from disk.
Parquet is a columnar storage file format that is optimized for big data workloads.
It is used in Apache Spark and other big data processing frameworks.
Delta is an open-source storage layer that prov...read more
Share interview questions and help millions of jobseekers 🌟
Q57. What is the Difference between Transformation and Actions in pyspark? And Give Example
Transformation in pyspark is lazy evaluation while Actions trigger execution of transformations.
Transformations are operations that are not executed immediately but create a plan for execution.
Actions are operations that trigger the execution of transformations and return results.
Examples of transformations include map, filter, and reduceByKey.
Examples of actions include collect, count, and saveAsTextFile.
Q58. Given a dictionary, find out the greatest number for same key in Python.
Find the greatest number for same key in a Python dictionary.
Use max() function with key parameter to find the maximum value for each key in the dictionary.
Iterate through the dictionary and apply max() function on each key.
If the dictionary is nested, use recursion to iterate through all the keys.
Data Engineer Jobs
Q59. SQL query for getting 2nd highest salary from each department
SQL query to retrieve the second highest salary from each department
Use the RANK() function to assign a rank to each salary within each department
Filter the results to only include rows with a rank of 2
Group the results by department to get the second highest salary for each department
Q60. Difference between RDD, Dataframe, Dataset.
RDD, Dataframe, and Dataset are data structures in Apache Spark with different characteristics and functionalities.
RDD (Resilient Distributed Datasets) is a fundamental data structure in Spark that represents an immutable distributed collection of objects. It provides low-level APIs for distributed data processing and fault tolerance.
Dataframe is a distributed collection of data organized into named columns. It is similar to a table in a relational database and provides a hig...read more
Q61. What is the probability that you can cut a rope into exactly two halves?
The probability of cutting a rope into exactly two halves is zero.
Cutting a rope into exactly two halves is impossible due to the thickness of the blade or scissors used.
Even if the rope is thin enough to be cut into two halves, the cut will never be perfectly straight.
Therefore, the probability of cutting a rope into exactly two halves is zero.
Q62. what is Common Expression Query (CTE)?How CTE is different from Stored Procedure?
CTE is a temporary result set that can be referenced within a SELECT, INSERT, UPDATE, or DELETE statement. It is different from a Stored Procedure as it is only available for the duration of the query.
CTE stands for Common Table Expression and is defined using the WITH keyword.
CTEs are mainly used for recursive queries, complex joins, and simplifying complex queries.
CTEs are not stored in the database like Stored Procedures, they exist only for the duration of the query execu...read more
Q63. what if you have to find out second highest transacting member in each city?
Use SQL query with window function to rank members by transaction amount in each city.
Use SQL query with PARTITION BY clause to group members by city
Use ORDER BY clause to rank members by transaction amount
Select the second highest member for each city
Q64. Python sort element of string array by alphabet weightage and resolve all 23 test cases that nobody can
Sort string array elements by alphabet weightage in Python and pass 23 test cases.
Use the sorted() function with key parameter to sort elements by weightage
Define a function to calculate weightage of each character
Test the function with various test cases to ensure accuracy
Q65. RDDs vs DataFrames. Which is better and why
DataFrames are better than RDDs due to their optimized performance and ease of use.
DataFrames are optimized for better performance than RDDs.
DataFrames have a schema, making it easier to work with structured data.
DataFrames support SQL queries and can be used with Spark SQL.
RDDs are more low-level and require more manual optimization.
RDDs are useful for unstructured data or when fine-grained control is needed.
Q66. Connecting Spark to Azure SQL Database.
Spark can connect to Azure SQL Database using JDBC driver.
Download and install the JDBC driver for Azure SQL Database.
Set up the connection string with the appropriate credentials.
Use the JDBC API to connect Spark to Azure SQL Database.
Example: val df = spark.read.jdbc(jdbcUrl, tableName, connectionProperties)
Ensure that the firewall rules for the Azure SQL Database allow access from the Spark cluster.
Q67. How to delete duplicate rows from a table
To delete duplicate rows from a table, use the DISTINCT keyword or GROUP BY clause.
Use the DISTINCT keyword to select unique rows from the table.
Use the GROUP BY clause to group the rows by a specific column and select the unique rows.
Use the DELETE statement with a subquery to delete the duplicate rows.
Create a new table with the unique rows and drop the old table.
Q68. About ETL - What do you know about it and what are fundamental factors to be considered while working on any ETL tool.
ETL stands for Extract, Transform, Load. It is a process of extracting data from various sources, transforming it, and loading it into a target system.
ETL is used to integrate data from different sources into a unified format.
The fundamental factors to consider while working on any ETL tool include data extraction, data transformation, and data loading.
Data extraction involves retrieving data from various sources such as databases, files, APIs, etc.
Data transformation involve...read more
Q69. What are all the issues you faced in your project? What is Global Parameter? Why do we need parameters inADF? What are the API's in Spark?
Answering questions related to data engineering
Issues faced in project: data quality, scalability, performance, integration
Global parameter: a parameter that can be accessed across multiple components in a system
Parameters in ADF: used to pass values between activities in a pipeline
APIs in Spark: Spark SQL, Spark Streaming, MLlib, GraphX
Q70. Difference between the interactive cluster and job cluster ?
Interactive clusters allow for real-time interaction and exploration, while job clusters are used for running batch jobs.
Interactive clusters are used for real-time data exploration and analysis.
Job clusters are used for running batch jobs and processing large amounts of data.
Interactive clusters are typically smaller in size and have shorter lifespans.
Job clusters are usually larger and more powerful to handle heavy workloads.
Examples: Interactive clusters can be used for ad...read more
Q71. 2) Difference between partitioning and Bucketing
Partitioning is dividing data into smaller chunks based on a column value. Bucketing is dividing data into equal-sized buckets based on a hash function.
Partitioning is used for organizing data for efficient querying and processing.
Bucketing is used for evenly distributing data across nodes in a cluster.
Partitioning is done based on a column value, such as date or region.
Bucketing is done based on a hash function, such as MD5 or SHA-1.
Partitioning can improve query performance...read more
Q72. Tell me about a data engineering challenge you faced. How did you tackle it and what was the outcome?
Migrating data from on-premise servers to cloud storage
Identified data sources and destination in cloud storage
Developed ETL pipelines to extract, transform, and load data
Ensured data integrity and security during migration process
Monitored and optimized performance of data transfer
Collaborated with cross-functional teams for successful migration
Q73. What are the window functions you have used?
Window functions are used to perform calculations across a set of rows that are related to the current row.
Commonly used window functions include ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, FIRST_VALUE, LAST_VALUE, and NTILE.
Window functions are used in conjunction with the OVER clause to define the window or set of rows to perform the calculation on.
Window functions can be used to calculate running totals, moving averages, and other aggregate calculations.
Window functions are s...read more
Q74. Which techniques you would have used on a system you have worked in the past to be able to optimise it further. (Deep knowledge of the databases is required)
I would have used indexing, query optimization, and data partitioning to optimize the system.
Implement indexing on frequently queried columns to improve search performance.
Optimize queries by using proper joins, filters, and aggregations.
Partition large tables to distribute data across multiple storage devices for faster access.
Use materialized views to precompute and store aggregated data for quicker retrieval.
Q75. How would you monitor an overnight data load job in snowflake
Monitor overnight data load job in Snowflake
Set up alerts and notifications for job completion or failure
Check job logs for any errors or issues
Monitor resource usage during the data load process
Use Snowflake's query history to track job progress
Implement automated retries in case of failures
Q76. how to achieve aws cross account sharing?
AWS cross account sharing can be achieved by using IAM roles and policies.
Create an IAM role in the account that will share resources
Define a trust policy in the role to allow the other account to assume the role
Attach a policy to the role granting the necessary permissions
In the receiving account, create an IAM role with a trust policy allowing the sharing account to assume the role
Use the AWS CLI or console to assume the role and access the shared resources
Q77. How do you handles null values in PySpark
Null values in PySpark are handled using functions such as dropna(), fillna(), and replace().
dropna() function is used to drop rows or columns with null values
fillna() function is used to fill null values with a specified value or method
replace() function is used to replace null values with a specified value
coalesce() function is used to replace null values with the first non-null value in a list of columns
Q78. How will you Join if two tables are large in pyspark?
Use broadcast join or partition join in pyspark to join two large tables efficiently.
Use broadcast join for smaller table and partition join for larger table.
Broadcast join - broadcast the smaller table to all worker nodes.
Partition join - partition both tables on the join key and join them.
Example: df1.join(broadcast(df2), 'join_key')
Example: df1.join(df2, 'join_key').repartition('join_key')
Q79. Static allocation in spark. 10TB of file needs to be processed in spark, what configuration (executors and cores) would you choose and why?
For processing 10TB of file in Spark, consider allocating multiple executors with sufficient cores to maximize parallel processing.
Allocate multiple executors to handle the large file size efficiently
Determine the optimal number of cores per executor based on the available resources and workload
Consider the memory requirements for each executor to avoid out-of-memory errors
Adjust the configuration based on the specific requirements of the job and cluster setup
Q80. 1) What is internal mechanism in spark . 2) tungsten project in spark explanation 3) sql problem to check where last two transaction belongs to particular retail
Questions related to Spark internals, Tungsten project, and SQL problem for retail transactions.
Spark's internal mechanism includes components like Spark Core, Spark SQL, Spark Streaming, and MLlib.
Tungsten project in Spark aims to improve the performance of Spark by optimizing memory usage and CPU utilization.
To solve the SQL problem, we can use a query to filter transactions for a particular retail and then use the 'ORDER BY' clause to sort them by date and time. We can the...read more
Q81. what is view in SQL and dense and dense rank
View is a virtual table created from a SQL query. Dense rank assigns a unique rank to each row in a result set.
A view is a saved SQL query that can be used as a table
Dense rank assigns a unique rank to each row in a result set, with no gaps between the ranks
Dense rank is used to rank rows based on a specific column or set of columns
Example: SELECT * FROM my_view WHERE column_name = 'value'
Example: SELECT column_name, DENSE_RANK() OVER (ORDER BY column_name) FROM my_table
Q82. What Volume of data have you handled in your POCs ?
I have handled terabytes of data in my POCs, including data from various sources and formats.
Handled terabytes of data in POCs
Worked with data from various sources and formats
Used tools like Hadoop, Spark, and SQL for data processing
Q83. SQL question to return the shortest duration of flight to travel from NY to HND
Use SQL query to find shortest flight duration from NY to HND
Use SQL query with MIN function to find shortest duration
Filter flights from NY to HND using WHERE clause
Calculate duration by subtracting arrival time from departure time
Q84. There are four cores and four worker nodes in Spark. How many jobs will run in parallel?
Only one job will run in parallel in Spark with four cores and four worker nodes.
In Spark, each core can only run one task at a time, so with four cores, only four tasks can run concurrently.
Since there are four worker nodes, each with four cores, a total of four tasks can run in parallel.
Therefore, only one job will run in parallel in this scenario.
Q85. What are the optimisation techniques you have used in your project ?
I have used techniques like indexing, query optimization, and parallel processing in my projects.
Indexing: Used to improve the speed of data retrieval by creating indexes on columns frequently used in queries.
Query optimization: Rewriting queries to improve efficiency and reduce execution time.
Parallel processing: Distributing tasks across multiple processors to speed up data processing.
Caching: Storing frequently accessed data in memory to reduce the need for repeated retrie...read more
Q86. What are the different AWS Data Analytics services used in your project and explanation for why each service was used? What are the alternate services available and why they were not used in the project? Questi...
read moreAWS Data Analytics services used in project, alternate services, and RDBMS concepts
AWS Data Analytics services used: Amazon Redshift for data warehousing, Amazon EMR for big data processing, Amazon Athena for interactive querying
Explanation for usage: Redshift for storing and analyzing large datasets, EMR for processing and analyzing big data, Athena for ad-hoc querying
Alternate services not used: Amazon RDS for relational database management, Amazon Kinesis for real-time dat...read more
Q87. design a business case to use self join? Condition : not use hirachical usecase like teacher student employee manager father and grandfather
Using self join to analyze customer behavior in an e-commerce platform.
Identifying patterns in customer purchase history
Analyzing customer preferences based on past purchases
Segmenting customers based on their buying behavior
Q88. what is Normalization is sql and explain 1NF 2NF 3NF?
Normalization in SQL is the process of organizing data in a database to reduce redundancy and improve data integrity.
1NF (First Normal Form) - Each column in a table must contain atomic values, and there should be no repeating groups.
2NF (Second Normal Form) - Table should be in 1NF and all non-key attributes are fully functional dependent on the primary key.
3NF (Third Normal Form) - Table should be in 2NF and there should be no transitive dependencies between non-key attribu...read more
Q89. what are the tools I used for the data engineering ?
Tools used for data engineering include ETL tools, programming languages, databases, and cloud platforms.
ETL tools like Apache NiFi, Talend, and Informatica are used for data extraction, transformation, and loading.
Programming languages like Python, Java, and Scala are used for data processing and analysis.
Databases like MySQL, PostgreSQL, and MongoDB are used for storing and managing data.
Cloud platforms like AWS, Azure, and Google Cloud provide scalable infrastructure for d...read more
Q90. Write function to check if number is an Armstrong Number
Function to check if a number is an Armstrong Number
An Armstrong Number is a number that is equal to the sum of its own digits raised to the power of the number of digits
To check if a number is an Armstrong Number, we need to calculate the sum of each digit raised to the power of the number of digits
If the sum is equal to the original number, then it is an Armstrong Number
Q91. 3) Difference between cache and persistent storage
Cache is temporary storage used to speed up access to frequently accessed data. Persistent storage is permanent storage used to store data even after power loss.
Cache is faster but smaller than persistent storage
Cache is volatile and data is lost when power is lost
Persistent storage is non-volatile and data is retained even after power loss
Examples of cache include CPU cache, browser cache, and CDN cache
Examples of persistent storage include hard disk drives, solid-state driv...read more
Q92. What is IR - integration Runtime? what are the types of IR
Integration Runtime (IR) is a compute infrastructure that provides data integration capabilities across different network environments.
IR is used in Azure Data Factory to provide data integration capabilities
There are three types of IR: Azure, Self-hosted, and Azure-SSIS
Azure IR is fully managed by Microsoft and is used for data movement in the cloud
Self-hosted IR allows data movement between on-premises and cloud data stores
Azure-SSIS IR is used for running SQL Server Integr...read more
Q93. What is imputer function in PySpark
Imputer function in PySpark is used to replace missing values in a DataFrame.
Imputer is a transformer in PySpark ML library.
It replaces missing values in a DataFrame with either mean, median, or mode of the column.
It can be used with both numerical and categorical columns.
Example: imputer = Imputer(inputCols=['col1', 'col2'], outputCols=['col1_imputed', 'col2_imputed'], strategy='mean')
Example: imputed_df = imputer.fit(df).transform(df)
Q94. How to handle duplicates in python ?
Use Python's built-in data structures like sets or dictionaries to handle duplicates.
Use a set to remove duplicates from a list: unique_list = list(set(original_list))
Use a dictionary to remove duplicates from a list while preserving order: unique_list = list(dict.fromkeys(original_list))
Q95. What are action and transformation ?
Actions and transformations are key concepts in data engineering, involving the manipulation and processing of data.
Actions are operations that trigger the execution of a data transformation job in a distributed computing environment.
Transformations are functions that take an input dataset and produce an output dataset, often involving filtering, aggregating, or joining data.
Examples of actions include 'saveAsTextFile' in Apache Spark, which saves the RDD to a text file, and ...read more
Q96. 1) Optimizations techniques used while working on Spark and hive. 2) difference between partitioning and bucketing 3) How to add column in Data frame 4) Difference between cache and persistent
Answers to questions related to Spark, Hive, and Data Frames
Optimization techniques in Spark and Hive include partitioning, bucketing, and caching
Partitioning divides data into smaller, more manageable chunks while bucketing groups data based on a specific column
Adding a column to a Data Frame can be done using the 'withColumn' method
Caching stores data in memory for faster access while persistence stores data on disk for durability
Q97. write sql code to get the city1 city2 distance of table if city1 and city2 tables can repeat
SQL code to get the city1 city2 distance of table with repeating city1 and city2 values
Use a self join on the table to match city1 and city2
Calculate the distance between the cities using appropriate formula
Consider using a subquery if needed
Q98. Difference between logical plan and physical plan in pyspark?
Logical plan represents the high-level abstract representation of the computation to be performed, while physical plan represents the actual execution plan with specific optimizations and details.
Logical plan is a high-level abstract representation of the computation to be performed.
Physical plan is the actual execution plan with specific optimizations and details.
Logical plan is created first and then optimized to generate the physical plan.
Physical plan includes details lik...read more
Q99. What is afd? build dynamic pipeline spark arcticture sql data flow
AFD is not a commonly used term in data engineering. Can you provide more context?
Q100. Python dataframes and how we use them in project and where at time
Python dataframes are used to organize and manipulate data in a tabular format.
Dataframes are created using the pandas library in Python.
They allow for easy manipulation of data, such as filtering, sorting, and grouping.
Dataframes can be used in various projects, such as data analysis, machine learning, and data visualization.
Examples of using dataframes include analyzing sales data, predicting customer behavior, and visualizing stock market trends.
Interview Questions of Similar Designations
Top Interview Questions for Data Engineer Related Skills
Interview experiences of popular companies
Calculate your in-hand salary
Confused about how your in-hand salary is calculated? Enter your annual salary (CTC) and get your in-hand salary
Reviews
Interviews
Salaries
Users/Month