Data Engineer
1000+ Data Engineer Interview Questions and Answers
Q51. Hive types of tables and difference between them
Hive has two types of tables - Managed and External. Managed tables are managed by Hive, while External tables are managed outside of Hive.
Managed tables are created using 'CREATE TABLE' command and data is stored in Hive's warehouse directory
External tables are created using 'CREATE EXTERNAL TABLE' command and data is stored outside of Hive's warehouse directory
Managed tables are deleted when the table is dropped, while External tables are not
Managed tables have full control...read more
Q52. Methods of migrating Hive metdatastore to unity catalog in Databricks ?
Use Databricks provided tools like databricks-connect and databricks-cli to migrate Hive metadata to Unity catalog.
Use databricks-connect to connect to the Databricks workspace from your local development environment.
Use databricks-cli to export the Hive metadata from the existing Hive metastore.
Create a new Unity catalog in Databricks and import the exported metadata using databricks-cli.
Validate the migration by checking the tables and databases in the Unity catalog.
Q53. Design the generic tool or package using pyspark which allows to create connections to multiple databases like mysql, s3 or api. Fetch the result and do transformations like handling null values and then store...
read moreDesign a generic tool in pyspark to connect to multiple databases, fetch results, handle null values, and store output in another database
Use pyspark to create a tool that can connect to databases like mysql, s3, or api
Implement functions to fetch data from the databases and perform transformations like handling null values
Utilize pyspark to store the transformed data in another database
Consider using pyspark SQL functions for data transformations
Q54. convert string of multiple lines with 'n' words to multiple arrays of fixed size: k, with no overlap of elements accross arrays.
Convert a string of multiple lines with 'n' words to multiple arrays of fixed size without overlap.
Split the string into individual words
Create arrays of fixed size 'k' and distribute words evenly
Handle cases where the number of words is not divisible by 'k'
Q55. 4) How to read json data using spark
To read JSON data using Spark, use the SparkSession.read.json() method.
Create a SparkSession object
Use the read.json() method to read the JSON data
Specify the path to the JSON file or directory containing JSON files
The resulting DataFrame can be manipulated using Spark's DataFrame API
Q56. 1. What is columnar storage,parquet,delta? Why it is used
Columnar storage is a data storage format that stores data in columns rather than rows, improving query performance.
Columnar storage stores data in a column-wise manner instead of row-wise.
It improves query performance by reducing the amount of data that needs to be read from disk.
Parquet is a columnar storage file format that is optimized for big data workloads.
It is used in Apache Spark and other big data processing frameworks.
Delta is an open-source storage layer that prov...read more
Share interview questions and help millions of jobseekers 🌟
Q57. What is the Difference between Transformation and Actions in pyspark? And Give Example
Transformation in pyspark is lazy evaluation while Actions trigger execution of transformations.
Transformations are operations that are not executed immediately but create a plan for execution.
Actions are operations that trigger the execution of transformations and return results.
Examples of transformations include map, filter, and reduceByKey.
Examples of actions include collect, count, and saveAsTextFile.
Q58. Given a dictionary, find out the greatest number for same key in Python.
Find the greatest number for same key in a Python dictionary.
Use max() function with key parameter to find the maximum value for each key in the dictionary.
Iterate through the dictionary and apply max() function on each key.
If the dictionary is nested, use recursion to iterate through all the keys.
Data Engineer Jobs
Q59. Why and when to use Generators and decorators in python?
Generators are used to create iterators, while decorators are used to modify functions or methods.
Generators are used to generate a sequence of values lazily, saving memory and improving performance.
Decorators are used to add functionality to existing functions or methods without modifying their code.
Generators are useful when dealing with large datasets or infinite sequences.
Decorators can be used for logging, caching, authentication, and more.
Example of generator: def my_ge...read more
Q60. SQL query for getting 2nd highest salary from each department
SQL query to retrieve the second highest salary from each department
Use the RANK() function to assign a rank to each salary within each department
Filter the results to only include rows with a rank of 2
Group the results by department to get the second highest salary for each department
Q61. Difference between RDD, Dataframe, Dataset.
RDD, Dataframe, and Dataset are data structures in Apache Spark with different characteristics and functionalities.
RDD (Resilient Distributed Datasets) is a fundamental data structure in Spark that represents an immutable distributed collection of objects. It provides low-level APIs for distributed data processing and fault tolerance.
Dataframe is a distributed collection of data organized into named columns. It is similar to a table in a relational database and provides a hig...read more
Q62. What is the probability that you can cut a rope into exactly two halves?
The probability of cutting a rope into exactly two halves is zero.
Cutting a rope into exactly two halves is impossible due to the thickness of the blade or scissors used.
Even if the rope is thin enough to be cut into two halves, the cut will never be perfectly straight.
Therefore, the probability of cutting a rope into exactly two halves is zero.
Q63. what is Common Expression Query (CTE)?How CTE is different from Stored Procedure?
CTE is a temporary result set that can be referenced within a SELECT, INSERT, UPDATE, or DELETE statement. It is different from a Stored Procedure as it is only available for the duration of the query.
CTE stands for Common Table Expression and is defined using the WITH keyword.
CTEs are mainly used for recursive queries, complex joins, and simplifying complex queries.
CTEs are not stored in the database like Stored Procedures, they exist only for the duration of the query execu...read more
Q64. what if you have to find out second highest transacting member in each city?
Use SQL query with window function to rank members by transaction amount in each city.
Use SQL query with PARTITION BY clause to group members by city
Use ORDER BY clause to rank members by transaction amount
Select the second highest member for each city
Q65. Python sort element of string array by alphabet weightage and resolve all 23 test cases that nobody can
Sort string array elements by alphabet weightage in Python and pass 23 test cases.
Use the sorted() function with key parameter to sort elements by weightage
Define a function to calculate weightage of each character
Test the function with various test cases to ensure accuracy
Q66. RDDs vs DataFrames. Which is better and why
DataFrames are better than RDDs due to their optimized performance and ease of use.
DataFrames are optimized for better performance than RDDs.
DataFrames have a schema, making it easier to work with structured data.
DataFrames support SQL queries and can be used with Spark SQL.
RDDs are more low-level and require more manual optimization.
RDDs are useful for unstructured data or when fine-grained control is needed.
Q67. Connecting Spark to Azure SQL Database.
Spark can connect to Azure SQL Database using JDBC driver.
Download and install the JDBC driver for Azure SQL Database.
Set up the connection string with the appropriate credentials.
Use the JDBC API to connect Spark to Azure SQL Database.
Example: val df = spark.read.jdbc(jdbcUrl, tableName, connectionProperties)
Ensure that the firewall rules for the Azure SQL Database allow access from the Spark cluster.
Q68. How to delete duplicate rows from a table
To delete duplicate rows from a table, use the DISTINCT keyword or GROUP BY clause.
Use the DISTINCT keyword to select unique rows from the table.
Use the GROUP BY clause to group the rows by a specific column and select the unique rows.
Use the DELETE statement with a subquery to delete the duplicate rows.
Create a new table with the unique rows and drop the old table.
Q69. About ETL - What do you know about it and what are fundamental factors to be considered while working on any ETL tool.
ETL stands for Extract, Transform, Load. It is a process of extracting data from various sources, transforming it, and loading it into a target system.
ETL is used to integrate data from different sources into a unified format.
The fundamental factors to consider while working on any ETL tool include data extraction, data transformation, and data loading.
Data extraction involves retrieving data from various sources such as databases, files, APIs, etc.
Data transformation involve...read more
Q70. What are all the issues you faced in your project? What is Global Parameter? Why do we need parameters inADF? What are the API's in Spark?
Answering questions related to data engineering
Issues faced in project: data quality, scalability, performance, integration
Global parameter: a parameter that can be accessed across multiple components in a system
Parameters in ADF: used to pass values between activities in a pipeline
APIs in Spark: Spark SQL, Spark Streaming, MLlib, GraphX
Q71. 2) Difference between partitioning and Bucketing
Partitioning is dividing data into smaller chunks based on a column value. Bucketing is dividing data into equal-sized buckets based on a hash function.
Partitioning is used for organizing data for efficient querying and processing.
Bucketing is used for evenly distributing data across nodes in a cluster.
Partitioning is done based on a column value, such as date or region.
Bucketing is done based on a hash function, such as MD5 or SHA-1.
Partitioning can improve query performance...read more
Q72. Tell me about a data engineering challenge you faced. How did you tackle it and what was the outcome?
Migrating data from on-premise servers to cloud storage
Identified data sources and destination in cloud storage
Developed ETL pipelines to extract, transform, and load data
Ensured data integrity and security during migration process
Monitored and optimized performance of data transfer
Collaborated with cross-functional teams for successful migration
Q73. What are the window functions you have used?
Window functions are used to perform calculations across a set of rows that are related to the current row.
Commonly used window functions include ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, FIRST_VALUE, LAST_VALUE, and NTILE.
Window functions are used in conjunction with the OVER clause to define the window or set of rows to perform the calculation on.
Window functions can be used to calculate running totals, moving averages, and other aggregate calculations.
Window functions are s...read more
Q74. Difference between the interactive cluster and job cluster ?
Interactive clusters allow for real-time interaction and exploration, while job clusters are used for running batch jobs.
Interactive clusters are used for real-time data exploration and analysis.
Job clusters are used for running batch jobs and processing large amounts of data.
Interactive clusters are typically smaller in size and have shorter lifespans.
Job clusters are usually larger and more powerful to handle heavy workloads.
Examples: Interactive clusters can be used for ad...read more
Q75. Which techniques you would have used on a system you have worked in the past to be able to optimise it further. (Deep knowledge of the databases is required)
I would have used indexing, query optimization, and data partitioning to optimize the system.
Implement indexing on frequently queried columns to improve search performance.
Optimize queries by using proper joins, filters, and aggregations.
Partition large tables to distribute data across multiple storage devices for faster access.
Use materialized views to precompute and store aggregated data for quicker retrieval.
Q76. How do you handles null values in PySpark
Null values in PySpark are handled using functions such as dropna(), fillna(), and replace().
dropna() function is used to drop rows or columns with null values
fillna() function is used to fill null values with a specified value or method
replace() function is used to replace null values with a specified value
coalesce() function is used to replace null values with the first non-null value in a list of columns
Q77. How would you monitor an overnight data load job in snowflake
Monitor overnight data load job in Snowflake
Set up alerts and notifications for job completion or failure
Check job logs for any errors or issues
Monitor resource usage during the data load process
Use Snowflake's query history to track job progress
Implement automated retries in case of failures
Q78. how to achieve aws cross account sharing?
AWS cross account sharing can be achieved by using IAM roles and policies.
Create an IAM role in the account that will share resources
Define a trust policy in the role to allow the other account to assume the role
Attach a policy to the role granting the necessary permissions
In the receiving account, create an IAM role with a trust policy allowing the sharing account to assume the role
Use the AWS CLI or console to assume the role and access the shared resources
Q79. 1) What is internal mechanism in spark . 2) tungsten project in spark explanation 3) sql problem to check where last two transaction belongs to particular retail
Questions related to Spark internals, Tungsten project, and SQL problem for retail transactions.
Spark's internal mechanism includes components like Spark Core, Spark SQL, Spark Streaming, and MLlib.
Tungsten project in Spark aims to improve the performance of Spark by optimizing memory usage and CPU utilization.
To solve the SQL problem, we can use a query to filter transactions for a particular retail and then use the 'ORDER BY' clause to sort them by date and time. We can the...read more
Q80. what is view in SQL and dense and dense rank
View is a virtual table created from a SQL query. Dense rank assigns a unique rank to each row in a result set.
A view is a saved SQL query that can be used as a table
Dense rank assigns a unique rank to each row in a result set, with no gaps between the ranks
Dense rank is used to rank rows based on a specific column or set of columns
Example: SELECT * FROM my_view WHERE column_name = 'value'
Example: SELECT column_name, DENSE_RANK() OVER (ORDER BY column_name) FROM my_table
Q81. How will you Join if two tables are large in pyspark?
Use broadcast join or partition join in pyspark to join two large tables efficiently.
Use broadcast join for smaller table and partition join for larger table.
Broadcast join - broadcast the smaller table to all worker nodes.
Partition join - partition both tables on the join key and join them.
Example: df1.join(broadcast(df2), 'join_key')
Example: df1.join(df2, 'join_key').repartition('join_key')
Q82. Static allocation in spark. 10TB of file needs to be processed in spark, what configuration (executors and cores) would you choose and why?
For processing 10TB of file in Spark, consider allocating multiple executors with sufficient cores to maximize parallel processing.
Allocate multiple executors to handle the large file size efficiently
Determine the optimal number of cores per executor based on the available resources and workload
Consider the memory requirements for each executor to avoid out-of-memory errors
Adjust the configuration based on the specific requirements of the job and cluster setup
Q83. What Volume of data have you handled in your POCs ?
I have handled terabytes of data in my POCs, including data from various sources and formats.
Handled terabytes of data in POCs
Worked with data from various sources and formats
Used tools like Hadoop, Spark, and SQL for data processing
Q84. What are the different AWS Data Analytics services used in your project and explanation for why each service was used? What are the alternate services available and why they were not used in the project? Questi...
read moreAWS Data Analytics services used in project, alternate services, and RDBMS concepts
AWS Data Analytics services used: Amazon Redshift for data warehousing, Amazon EMR for big data processing, Amazon Athena for interactive querying
Explanation for usage: Redshift for storing and analyzing large datasets, EMR for processing and analyzing big data, Athena for ad-hoc querying
Alternate services not used: Amazon RDS for relational database management, Amazon Kinesis for real-time dat...read more
Q85. design a business case to use self join? Condition : not use hirachical usecase like teacher student employee manager father and grandfather
Using self join to analyze customer behavior in an e-commerce platform.
Identifying patterns in customer purchase history
Analyzing customer preferences based on past purchases
Segmenting customers based on their buying behavior
Q86. SQL question to return the shortest duration of flight to travel from NY to HND
Use SQL query to find shortest flight duration from NY to HND
Use SQL query with MIN function to find shortest duration
Filter flights from NY to HND using WHERE clause
Calculate duration by subtracting arrival time from departure time
Q87. what is Normalization is sql and explain 1NF 2NF 3NF?
Normalization in SQL is the process of organizing data in a database to reduce redundancy and improve data integrity.
1NF (First Normal Form) - Each column in a table must contain atomic values, and there should be no repeating groups.
2NF (Second Normal Form) - Table should be in 1NF and all non-key attributes are fully functional dependent on the primary key.
3NF (Third Normal Form) - Table should be in 2NF and there should be no transitive dependencies between non-key attribu...read more
Q88. There are four cores and four worker nodes in Spark. How many jobs will run in parallel?
Only one job will run in parallel in Spark with four cores and four worker nodes.
In Spark, each core can only run one task at a time, so with four cores, only four tasks can run concurrently.
Since there are four worker nodes, each with four cores, a total of four tasks can run in parallel.
Therefore, only one job will run in parallel in this scenario.
Q89. What are the optimisation techniques you have used in your project ?
I have used techniques like indexing, query optimization, and parallel processing in my projects.
Indexing: Used to improve the speed of data retrieval by creating indexes on columns frequently used in queries.
Query optimization: Rewriting queries to improve efficiency and reduce execution time.
Parallel processing: Distributing tasks across multiple processors to speed up data processing.
Caching: Storing frequently accessed data in memory to reduce the need for repeated retrie...read more
Q90. Write function to check if number is an Armstrong Number
Function to check if a number is an Armstrong Number
An Armstrong Number is a number that is equal to the sum of its own digits raised to the power of the number of digits
To check if a number is an Armstrong Number, we need to calculate the sum of each digit raised to the power of the number of digits
If the sum is equal to the original number, then it is an Armstrong Number
Q91. what are the tools I used for the data engineering ?
Tools used for data engineering include ETL tools, programming languages, databases, and cloud platforms.
ETL tools like Apache NiFi, Talend, and Informatica are used for data extraction, transformation, and loading.
Programming languages like Python, Java, and Scala are used for data processing and analysis.
Databases like MySQL, PostgreSQL, and MongoDB are used for storing and managing data.
Cloud platforms like AWS, Azure, and Google Cloud provide scalable infrastructure for d...read more
Q92. 3) Difference between cache and persistent storage
Cache is temporary storage used to speed up access to frequently accessed data. Persistent storage is permanent storage used to store data even after power loss.
Cache is faster but smaller than persistent storage
Cache is volatile and data is lost when power is lost
Persistent storage is non-volatile and data is retained even after power loss
Examples of cache include CPU cache, browser cache, and CDN cache
Examples of persistent storage include hard disk drives, solid-state driv...read more
Q93. How can you design an Azure Data Factory pipeline to copy data from a folder containing files with different delimiters to another folder?
Design an Azure Data Factory pipeline to copy data with different delimiters.
Use a Copy Data activity in Azure Data Factory to copy data from source folder to destination folder.
Create a dataset for the source folder with multiple file formats to handle different delimiters.
Use a mapping data flow to transform the data if needed before copying to the destination folder.
Q94. What is IR - integration Runtime? what are the types of IR
Integration Runtime (IR) is a compute infrastructure that provides data integration capabilities across different network environments.
IR is used in Azure Data Factory to provide data integration capabilities
There are three types of IR: Azure, Self-hosted, and Azure-SSIS
Azure IR is fully managed by Microsoft and is used for data movement in the cloud
Self-hosted IR allows data movement between on-premises and cloud data stores
Azure-SSIS IR is used for running SQL Server Integr...read more
Q95. What is the difference between reparation and coalesce? What is the difference between persist and cache?
repartition vs coalesce, persist vs cache
repartition is used to increase or decrease the number of partitions in a DataFrame, while coalesce is used to decrease the number of partitions without shuffling
persist is used to persist the DataFrame in memory or disk for faster access, while cache is a shorthand for persisting the DataFrame in memory only
repartition example: df.repartition(10)
coalesce example: df.coalesce(5)
persist example: df.persist()
cache example: df.cache()
Q96. What is imputer function in PySpark
Imputer function in PySpark is used to replace missing values in a DataFrame.
Imputer is a transformer in PySpark ML library.
It replaces missing values in a DataFrame with either mean, median, or mode of the column.
It can be used with both numerical and categorical columns.
Example: imputer = Imputer(inputCols=['col1', 'col2'], outputCols=['col1_imputed', 'col2_imputed'], strategy='mean')
Example: imputed_df = imputer.fit(df).transform(df)
Q97. 1) Optimizations techniques used while working on Spark and hive. 2) difference between partitioning and bucketing 3) How to add column in Data frame 4) Difference between cache and persistent
Answers to questions related to Spark, Hive, and Data Frames
Optimization techniques in Spark and Hive include partitioning, bucketing, and caching
Partitioning divides data into smaller, more manageable chunks while bucketing groups data based on a specific column
Adding a column to a Data Frame can be done using the 'withColumn' method
Caching stores data in memory for faster access while persistence stores data on disk for durability
Q98. How to handle duplicates in python ?
Use Python's built-in data structures like sets or dictionaries to handle duplicates.
Use a set to remove duplicates from a list: unique_list = list(set(original_list))
Use a dictionary to remove duplicates from a list while preserving order: unique_list = list(dict.fromkeys(original_list))
Q99. What are action and transformation ?
Actions and transformations are key concepts in data engineering, involving the manipulation and processing of data.
Actions are operations that trigger the execution of a data transformation job in a distributed computing environment.
Transformations are functions that take an input dataset and produce an output dataset, often involving filtering, aggregating, or joining data.
Examples of actions include 'saveAsTextFile' in Apache Spark, which saves the RDD to a text file, and ...read more
Q100. What are the concepts of coalesce and repartition in data processing?
Coalesce and repartition are concepts used in data processing to control the number of partitions in a dataset.
Coalesce is used to reduce the number of partitions in a dataset without shuffling the data, which can improve performance.
Repartition is used to increase or decrease the number of partitions in a dataset by shuffling the data across the cluster.
Coalesce is preferred over repartition when reducing partitions to avoid unnecessary shuffling of data.
Repartition is usefu...read more
Interview Questions of Similar Designations
Top Interview Questions for Data Engineer Related Skills
Interview experiences of popular companies
Calculate your in-hand salary
Confused about how your in-hand salary is calculated? Enter your annual salary (CTC) and get your in-hand salary
Reviews
Interviews
Salaries
Users/Month