Data Engineer

1000+ Data Engineer Interview Questions and Answers

Updated 28 Mar 2025
search-icon

Q101. Write a python program for list/dictionary (comprehensions)

Ans.

Python program for list/dictionary comprehensions

  • List comprehension: [expression for item in iterable]

  • Dictionary comprehension: {key_expression: value_expression for item in iterable}

  • Example: squares = [x**2 for x in range(10)]

  • Example: dict_squares = {x: x**2 for x in range(10)}

Q102. Rate yourself out of 5 in Pyspark , Python and SQL

Ans.

I would rate myself 4 in Pyspark, 5 in Python, and 4 in SQL.

  • Strong proficiency in Python programming language

  • Experience in working with Pyspark for big data processing

  • Proficient in writing complex SQL queries for data manipulation

  • Familiarity with optimizing queries for performance

  • Hands-on experience in data engineering projects

Q103. Write a SQL to get Student names who got marks>45 in each subject from Student table

Ans.

SQL query to retrieve student names with marks > 45 in each subject

  • Use GROUP BY and HAVING clauses to filter students with marks > 45 in each subject

  • Join Student table with Marks table on student_id to get marks for each student

  • Select student names from Student table based on the conditions

Q104. Provide a approach to develop ETL pipeline where csv file is dropped in S3 and transform through airflow and load to snowflake.

Ans.

Develop ETL pipeline using Airflow to process CSV files in S3 and load data into Snowflake.

  • Set up an S3 sensor in Airflow to detect when a new CSV file is dropped in the specified bucket.

  • Create a custom Python operator in Airflow to read the CSV file from S3, perform necessary transformations, and load data into Snowflake.

  • Use SnowflakeHook in Airflow to establish connection with Snowflake and execute SQL queries to load data.

  • Schedule the ETL pipeline in Airflow to run at spec...read more

Are these interview questions helpful?

Q105. Calculate second highest salary using SQL as well as pyspark.

Ans.

Calculate second highest salary using SQL and pyspark

  • Use SQL query with ORDER BY and LIMIT to get the second highest salary

  • In pyspark, use orderBy() and take() functions to achieve the same result

Q106. How do you optimize SQL queries?

Ans.

Optimizing SQL queries involves using indexes, avoiding unnecessary joins, and optimizing the query structure.

  • Use indexes on columns frequently used in WHERE clauses

  • Avoid using SELECT * and only retrieve necessary columns

  • Optimize joins by using INNER JOIN instead of OUTER JOIN when possible

  • Use EXPLAIN to analyze query performance and make necessary adjustments

Share interview questions and help millions of jobseekers 🌟

man-with-laptop

Q107. Find second max salary 9f employee by department using python pandas dataframe

Ans.

Find second max salary of employee by department using pandas dataframe

  • Group the dataframe by department

  • Sort the salaries in descending order

  • Select the second highest salary for each department

Q108. What is the difference between UNION and UNION ALL?

Ans.

UNION combines and removes duplicates, UNION ALL combines without removing duplicates.

  • UNION merges the results of two or more SELECT statements into a single result set.

  • UNION ALL does the same as UNION, but it does not remove duplicates.

  • UNION is more resource-intensive than UNION ALL because it performs a sort operation.

  • UNION ALL is faster than UNION when all records are required.

  • Example: SELECT column1 FROM table1 UNION SELECT column1 FROM table2;

  • Example: SELECT column1 FROM...read more

Data Engineer Jobs

Architect - Data Engineering 8-13 years
PEPSICO GLOBAL BUSINESS SERVICES INDIA LLP
4.0
Hyderabad / Secunderabad
Data Engineer-Data Platforms-AWS 2-6 years
IBM India Pvt. Limited
4.0
Bangalore / Bengaluru
Data Engineer-Data Platforms-Google 2-5 years
IBM India Pvt. Limited
4.0
Hyderabad / Secunderabad

Q109. 3 sql queries. architecture for a data pipeline to display hour per activity in last 24 hra on a smartwatch .

Ans.

Design a data pipeline to display hourly activity on a smartwatch for the last 24 hours using SQL queries.

  • Create a table to store activity data with columns for timestamp and activity type.

  • Use a SQL query to aggregate activity data by hour for the last 24 hours.

  • Display the results on the smartwatch using a suitable visualization.

Q110. How to deal with data quality issues

Ans.

Data quality issues can be dealt with by identifying the root cause, implementing data validation checks, and establishing data governance policies.

  • Identify the root cause of the data quality issue

  • Implement data validation checks to prevent future issues

  • Establish data governance policies to ensure data accuracy and consistency

  • Regularly monitor and audit data quality

  • Involve stakeholders in the data quality process

  • Use data profiling and cleansing tools

  • Ensure data security and p...read more

Q111. What is ETL, and can you provide examples from your project to illustrate its application?

Ans.

ETL stands for Extract, Transform, Load. It is a process used to extract data from various sources, transform it into a consistent format, and load it into a target database.

  • Extract: Retrieving data from different sources such as databases, APIs, or files.

  • Transform: Cleaning, filtering, and structuring the extracted data to fit the target database schema.

  • Load: Loading the transformed data into the target database for analysis or reporting.

  • Example: Extracting customer informat...read more

Q112. What is the difference between RDD (Resilient Distributed Datasets) and DataFrame in Apache Spark?

Ans.

RDD is a low-level abstraction representing a distributed collection of objects, while DataFrame is a higher-level abstraction representing a distributed collection of data organized into named columns.

  • RDD is more suitable for unstructured data and low-level transformations, while DataFrame is more suitable for structured data and high-level abstractions.

  • DataFrames provide optimizations like query optimization and code generation, while RDDs do not.

  • DataFrames support SQL quer...read more

Q113. What is the difference between the reduceBy and groupBy transformations in Apache Spark?

Ans.

reduceBy is used to aggregate data based on key, while groupBy is used to group data based on key.

  • reduceBy is a transformation that combines the values of each key using an associative function and a neutral 'zero value'.

  • groupBy is a transformation that groups the data based on a key and returns a grouped data set.

  • reduceBy is more efficient for aggregating data as it reduces the data before shuffling, while groupBy shuffles all the data before grouping.

  • reduceBy is typically u...read more

Q114. ETL - How to do full load in SSIS, mention the steps

Ans.

To perform a full load in SSIS, you can use the Data Flow Task with a source and destination component.

  • Create a Data Flow Task in the Control Flow tab of the SSIS package.

  • Add a source component to extract data from the source system.

  • Add a destination component to load data into the destination system.

  • Map the columns from the source to the destination.

  • Run the package to execute the full load.

Q115. ETL- how to do the incremental load in ADF and in SSIS

Ans.

Incremental load in ADF and SSIS involves identifying new or updated data and loading only those changes.

  • In ADF, use watermark columns to track the last loaded value and filter data based on this value

  • In SSIS, use CDC (Change Data Capture) components or custom scripts to identify new or updated data

  • Both ADF and SSIS support incremental loading by comparing source and target data to determine changes

Q116. how to copy data without using multiple activities. Dynamically using loops/ parameterization.

Ans.

Use a single activity with dynamic parameterization and loops to copy data.

  • Use a loop to iterate through the data source and destination locations.

  • Parameterize the source and destination locations to dynamically copy data.

  • Utilize a scripting language like Python or PowerShell to implement the logic.

  • Example: Use a Python script with a loop to copy files from one folder to another.

  • Example: Use PowerShell script with dynamic parameters to copy data from one database to another.

Q117. how to insert non-duplicate data into target table. how many ways we can do.

Ans.

To insert non-duplicate data into a target table, you can use methods like using a unique constraint, using a merge statement, or using a temporary table.

  • Use a unique constraint on the target table to prevent duplicate entries.

  • Use a merge statement to insert data into the target table only if it does not already exist.

  • Use a temporary table to store the new data, then insert only the non-duplicate records into the target table.

Q118. Have you work on Lambda Function Explain it?

Ans.

Lambda function is a serverless computing service that runs code in response to events and automatically manages the computing resources required.

  • Lambda functions are event-driven and can be triggered by various AWS services such as S3, DynamoDB, API Gateway, etc.

  • They are written in languages like Python, Node.js, Java, etc.

  • Lambda functions are scalable and cost-effective as you only pay for the compute time you consume.

  • They can be used for data processing, real-time file pro...read more

Q119. Explain databricks dlt, and when will you use batch vs streaming?

Ans.

Databricks DLT is a unified data management platform for batch and streaming processing.

  • Databricks DLT (Delta Lake Table) is a storage layer that brings ACID transactions to Apache Spark and big data workloads.

  • Batch processing is used when data is collected over a period of time and processed in large chunks, while streaming processing is used for real-time data processing.

  • Use batch processing for historical data analysis, ETL jobs, and periodic reporting. Use streaming proce...read more

Q120. How to split staged data’s row into separate columns

Ans.

Use SQL functions like SUBSTRING and CHARINDEX to split staged data's row into separate columns

  • Use SUBSTRING function to extract specific parts of the row

  • Use CHARINDEX function to find the position of a specific character in the row

  • Use CASE statements to create separate columns based on conditions

Q121. what is difference repartition and coalesce

Ans.

Repartition increases the number of partitions in a DataFrame, while coalesce reduces the number of partitions without shuffling data.

  • Repartition involves a full shuffle of the data across the cluster, which can be expensive.

  • Coalesce minimizes data movement by only creating new partitions if necessary.

  • Repartition is typically used when increasing parallelism or evenly distributing data, while coalesce is used for reducing the number of partitions without a full shuffle.

  • Exampl...read more

Q122. 4. How to connect SQL server to databricks

Ans.

To connect SQL server to Databricks, use JDBC/ODBC drivers and configure the connection settings.

  • Install the appropriate JDBC/ODBC driver for SQL server

  • Configure the connection settings in Databricks

  • Use the JDBC/ODBC driver to establish the connection

Q123. diff btw view and indexed view, pipeline design process, where to monitor pipelibe failures and how to fix

Ans.

View vs indexed view, pipeline design process, monitoring pipeline failures and fixing

  • View is a virtual table based on a SELECT query, while indexed view is a view with a clustered index for faster retrieval

  • Pipeline design process involves defining data sources, transformations, and destinations

  • Monitor pipeline failures using logging tools like Apache NiFi or Airflow

  • Fix pipeline failures by identifying the root cause, adjusting configurations, or updating dependencies

Q124. What are types of structured query languages

Ans.

Types of structured query languages include SQL, PL/SQL, T-SQL, and others.

  • SQL (Structured Query Language) - widely used for managing relational databases

  • PL/SQL (Procedural Language/SQL) - Oracle's proprietary extension for SQL

  • T-SQL (Transact-SQL) - Microsoft's extension for SQL used in SQL Server

  • Others - include languages like MySQL, PostgreSQL, SQLite, etc.

Q125. Explain briefly how gcp is highly scalable and manage the data ?

Ans.

GCP offers scalable data solutions through managed services, allowing seamless data storage, processing, and analytics.

  • Auto-scaling: GCP services like Google Kubernetes Engine automatically adjust resources based on demand.

  • BigQuery: A serverless data warehouse that scales to handle petabytes of data without manual intervention.

  • Cloud Storage: Provides scalable object storage with high availability and durability, suitable for unstructured data.

  • Dataflow: A fully managed service...read more

Q126. What are some SQL queries that utilize joins and window functions?

Ans.

SQL queries using joins and window functions

  • Use INNER JOIN to combine rows from two or more tables based on a related column

  • Use WINDOW functions like ROW_NUMBER() to assign a unique sequential integer to each row within a partition

  • Example: SELECT column1, column2, ROW_NUMBER() OVER(PARTITION BY column1 ORDER BY column2) AS row_num FROM table_name

Q127. Are you familiar with Celebal Technologies

Ans.

Celebal Technologies is a technology company specializing in data engineering and analytics solutions.

  • Celebal Technologies is known for providing data engineering and analytics solutions.

  • They offer services such as data integration, data warehousing, and data visualization.

  • Celebal Technologies works with clients across various industries to help them optimize their data processes.

  • They have expertise in technologies like Hadoop, Spark, and Python for data engineering.

  • The compa...read more

Q128. What are the technologies you have worked on?

Ans.

I have worked on various technologies including Hadoop, Spark, SQL, Python, and AWS.

  • Experience with Hadoop and Spark for big data processing

  • Proficient in SQL for data querying and manipulation

  • Skilled in Python for data analysis and scripting

  • Familiarity with AWS services such as S3, EC2, and EMR

  • Knowledge of data warehousing and ETL processes

Q129. How to remove Duplicates in Data frame using pyspark?

Ans.

Use dropDuplicates() function in pyspark to remove duplicates in a data frame.

  • Use dropDuplicates() function on the data frame to remove duplicates based on all columns.

  • Specify subset of columns to remove duplicates based on specific columns.

  • Use the distinct() function to remove duplicates and keep only distinct rows.

Q130. How you do the alerting mechanism in adf for failed pipelines

Ans.

Alerting mechanism in ADF for failed pipelines involves setting up alerts in Azure Monitor and configuring email notifications.

  • Set up alerts in Azure Monitor for monitoring pipeline runs

  • Configure alert rules to trigger notifications when a pipeline run fails

  • Use Azure Logic Apps to send email notifications for failed pipeline runs

Q131. What is difference between Primary and unique key in dbms?

Ans.

Primary key uniquely identifies a record in a table, while unique key ensures all values in a column are distinct.

  • Primary key does not allow NULL values, while unique key allows one NULL value.

  • A table can have only one primary key, but multiple unique keys.

  • Primary key is a combination of unique and not null constraints.

  • Primary key is used to establish relationships between tables, while unique key is used to enforce uniqueness in a column.

Q132. what operator is used in composer to move data from gcs to bq

Ans.

The operator used in Composer to move data from GCS to BigQuery is the GCS to BigQuery operator.

  • The GCS to BigQuery operator is used in Apache Airflow, which is the underlying technology of Composer.

  • This operator allows you to transfer data from Google Cloud Storage (GCS) to BigQuery.

  • You can specify the source and destination parameters in the operator to define the data transfer process.

Q133. How would you build a pipeline to connect http source and bring data in adls

Ans.

Build a pipeline to connect http source and bring data in adls

  • Set up a data ingestion tool like Apache NiFi or Azure Data Factory to pull data from the http source

  • Transform the data as needed using tools like Apache Spark or Azure Databricks

  • Store the data in Azure Data Lake Storage (ADLS) for further processing and analysis

Q134. How have you managed experiences involving strict timelines and deliverables?

Ans.

I have successfully managed experiences involving strict timelines and deliverables by prioritizing tasks, setting clear goals, and communicating effectively with team members.

  • Prioritizing tasks based on deadlines and importance

  • Setting clear goals and milestones to track progress

  • Communicating effectively with team members to ensure everyone is on the same page

  • Proactively identifying potential roadblocks and finding solutions to overcome them

Q135. Python vs Java advantages and disadvantages,Kubernetes

Ans.

Python is more flexible and easier to learn, while Java is more performant and better for large-scale projects. Kubernetes is a popular container orchestration tool.

  • Python is more concise and easier to read/write than Java

  • Java is more performant and better for large-scale projects

  • Kubernetes is a popular container orchestration tool used for managing containerized applications

  • Kubernetes provides features like automatic scaling, self-healing, and rolling updates

  • Python is often ...read more

Q136. How will you design/configure a cluster if you have given 10 petabytes of data.

Ans.

Designing/configuring a cluster for 10 petabytes of data involves considerations for storage capacity, processing power, network bandwidth, and fault tolerance.

  • Consider using a distributed file system like HDFS or object storage like Amazon S3 to store and manage the large volume of data.

  • Implement a scalable processing framework like Apache Spark or Hadoop to efficiently process and analyze the data in parallel.

  • Utilize a cluster management system like Apache Mesos or Kubernet...read more

Q137. What is the SQL query to find the third highest salary from a given table?

Ans.

Use SQL query with ORDER BY and LIMIT to find the third highest salary from a table.

  • Use ORDER BY clause to sort salaries in descending order

  • Use LIMIT 1 OFFSET 2 to skip the first two highest salaries

  • Example: SELECT salary FROM employees ORDER BY salary DESC LIMIT 1 OFFSET 2

Q138. How you will run a child notebook into a parent notebook using dbutils command

Ans.

Use dbutils.notebook.run() command to run a child notebook in a parent notebook

  • Use dbutils.notebook.run() command with the path to the child notebook and any parameters needed

  • Ensure that the child notebook is accessible and has necessary permissions

  • Handle any return values or errors from the child notebook appropriately

Q139. What is difference between hadoop and spark? Difference between coalesce and repartition? Sql query HDFS

Ans.

Hadoop is a distributed storage and processing framework, while Spark is a fast and general-purpose cluster computing system.

  • Hadoop is primarily used for batch processing of large datasets, while Spark is known for its in-memory processing capabilities.

  • Hadoop uses MapReduce for processing data, while Spark uses Resilient Distributed Datasets (RDDs).

  • Coalesce is used to reduce the number of partitions in a DataFrame or RDD without shuffling data, while repartition is used to in...read more

Q140. What will be spark configuration to process 2 gb of data

Ans.

Set spark configuration with appropriate memory and cores for efficient processing of 2 GB data

  • Increase executor memory and cores to handle larger data size

  • Adjust spark memory overhead to prevent out of memory errors

  • Optimize shuffle partitions for better performance

Q141. Assume below Dataframes DF1 (UserID,Name) DF2 (UserID,PageID,Timestamp,Events) Write code to Join the DF's, Count the No of Events and filter Users with 0 Events

Ans.

Join DF's, count events, filter users with 0 events

  • Use join operation to combine DF1 and DF2 on UserID

  • Group by UserID and count the number of events

  • Filter out users with 0 events

Q142. What is Re-Partition and Coalesce? How are these used?

Ans.

Re-Partition and Coalesce are methods used to control the number of partitions in a dataset in Apache Spark.

  • Re-Partition is used to increase or decrease the number of partitions in a dataset by shuffling the data across the cluster.

  • Coalesce is used to decrease the number of partitions in a dataset without shuffling the data, which can improve performance.

  • Re-Partition is typically used when there is a need to increase parallelism or balance the data distribution, while Coalesc...read more

Q143. 2 types of modes for Spark architecture ?

Ans.

The two types of modes for Spark architecture are standalone mode and cluster mode.

  • Standalone mode: Spark runs on a single machine with a single JVM and is suitable for development and testing.

  • Cluster mode: Spark runs on a cluster of machines managed by a cluster manager like YARN or Mesos for production workloads.

Q144. How would you build a pipeline for a Machine learning project?

Ans.

To build a pipeline for a Machine learning project, you need to collect data, preprocess it, train the model, evaluate its performance, and deploy it.

  • Collect relevant data from various sources

  • Preprocess the data by cleaning, transforming, and normalizing it

  • Split the data into training and testing sets

  • Train the machine learning model using the training data

  • Evaluate the model's performance using the testing data

  • Fine-tune the model if necessary

  • Deploy the model into production en...read more

Q145. What are your experiences with cloud services, particularly with Amazon Web Services (AWS)?

Ans.

I have extensive experience working with AWS, including setting up and managing various cloud services.

  • Implemented data pipelines using AWS Glue for ETL processes

  • Utilized AWS S3 for storing and managing large datasets

  • Deployed and managed EC2 instances for running data processing tasks

  • Used AWS Lambda for serverless computing and automation

  • Worked with AWS Redshift for data warehousing

Q146. what is difference between union vs union all

Ans.

Union combines and removes duplicates, Union All combines all rows including duplicates.

  • Union merges two tables and removes duplicates

  • Union All merges two tables and includes duplicates

  • Union is slower than Union All as it removes duplicates

  • Syntax: SELECT column1, column2 FROM table1 UNION/UNION ALL SELECT column1, column2 FROM table2

  • Example: SELECT name FROM table1 UNION SELECT name FROM table2

Q147. What are transformations, and how many types of transformations exist?

Ans.

Transformations are operations performed on data to convert it from one form to another. There are mainly two types of transformations: narrow and wide.

  • Transformations are operations performed on data to convert it from one form to another.

  • Narrow transformations are those where each input partition will contribute to only one output partition, e.g., map, filter.

  • Wide transformations are those where each input partition may contribute to multiple output partitions, e.g., groupB...read more

Q148. What are different type of joins available in Databricks?

Ans.

Different types of joins available in Databricks include inner join, outer join, left join, right join, and cross join.

  • Inner join: Returns only the rows that have matching values in both tables.

  • Outer join: Returns all rows when there is a match in either table.

  • Left join: Returns all rows from the left table and the matched rows from the right table.

  • Right join: Returns all rows from the right table and the matched rows from the left table.

  • Cross join: Returns the Cartesian prod...read more

Q149. Write code to print reverse of a sentence word by word.

Q150. What tools and frameworks you have used and details about that

Ans.

I have experience with tools like Apache Spark, Hadoop, and Kafka for data processing and streaming.

  • Apache Spark for large-scale data processing

  • Hadoop for distributed storage and processing

  • Kafka for real-time data streaming

Previous
1
2
3
4
5
6
7
Next
Interview Tips & Stories
Ace your next interview with expert advice and inspiring stories

Interview experiences of popular companies

3.7
 • 10.6k Interviews
3.8
 • 8.3k Interviews
3.6
 • 7.6k Interviews
3.7
 • 5.7k Interviews
3.7
 • 5.7k Interviews
4.0
 • 5.1k Interviews
3.7
 • 4.8k Interviews
3.7
 • 2.9k Interviews
4.0
 • 2.4k Interviews
View all

Calculate your in-hand salary

Confused about how your in-hand salary is calculated? Enter your annual salary (CTC) and get your in-hand salary

Data Engineer Interview Questions
Share an Interview
Stay ahead in your career. Get AmbitionBox app
qr-code
Helping over 1 Crore job seekers every month in choosing their right fit company
65 L+

Reviews

4 L+

Interviews

4 Cr+

Salaries

1 Cr+

Users/Month

Contribute to help millions

Made with ❤️ in India. Trademarks belong to their respective owners. All rights reserved © 2024 Info Edge (India) Ltd.

Follow us
  • Youtube
  • Instagram
  • LinkedIn
  • Facebook
  • Twitter