Data Engineer

1000+ Data Engineer Interview Questions and Answers

Updated 3 Mar 2025
search-icon

Q151. Discuss project and it's architecture.

Ans.

Developed a data pipeline to process and analyze customer behavior data.

  • Used Apache Kafka for real-time data streaming

  • Implemented data processing using Apache Spark

  • Stored data in Hadoop Distributed File System (HDFS)

  • Used Tableau for data visualization

Q152. What are your experiences with cloud services, particularly with Amazon Web Services (AWS)?

Ans.

I have extensive experience working with AWS, including setting up and managing various cloud services.

  • Implemented data pipelines using AWS Glue for ETL processes

  • Utilized AWS S3 for storing and managing large datasets

  • Deployed and managed EC2 instances for running data processing tasks

  • Used AWS Lambda for serverless computing and automation

  • Worked with AWS Redshift for data warehousing

Q153. what is difference between union vs union all

Ans.

Union combines and removes duplicates, Union All combines all rows including duplicates.

  • Union merges two tables and removes duplicates

  • Union All merges two tables and includes duplicates

  • Union is slower than Union All as it removes duplicates

  • Syntax: SELECT column1, column2 FROM table1 UNION/UNION ALL SELECT column1, column2 FROM table2

  • Example: SELECT name FROM table1 UNION SELECT name FROM table2

Q154. What are transformations, and how many types of transformations exist?

Ans.

Transformations are operations performed on data to convert it from one form to another. There are mainly two types of transformations: narrow and wide.

  • Transformations are operations performed on data to convert it from one form to another.

  • Narrow transformations are those where each input partition will contribute to only one output partition, e.g., map, filter.

  • Wide transformations are those where each input partition may contribute to multiple output partitions, e.g., groupB...read more

Are these interview questions helpful?

Q155. What are different type of joins available in Databricks?

Ans.

Different types of joins available in Databricks include inner join, outer join, left join, right join, and cross join.

  • Inner join: Returns only the rows that have matching values in both tables.

  • Outer join: Returns all rows when there is a match in either table.

  • Left join: Returns all rows from the left table and the matched rows from the right table.

  • Right join: Returns all rows from the right table and the matched rows from the left table.

  • Cross join: Returns the Cartesian prod...read more

Q156. What is the SQL query to calculate the average sales over a period of 7 days?

Ans.

Calculate average sales over a 7-day period using SQL query.

  • Use the AVG() function to calculate the average sales.

  • Filter the data based on the date range of the 7-day period using WHERE clause.

  • Group the data by date to calculate the average sales for each day.

  • Join the tables if necessary to get the sales data.

Share interview questions and help millions of jobseekers 🌟

man-with-laptop

Q157. What tools and frameworks you have used and details about that

Ans.

I have experience with tools like Apache Spark, Hadoop, and Kafka for data processing and streaming.

  • Apache Spark for large-scale data processing

  • Hadoop for distributed storage and processing

  • Kafka for real-time data streaming

Q158. What are lists? difference between list ,tupe, dict and sets ? how to extract elements from dict ?

Ans.

Lists are ordered collections of items in Python.

  • Lists are mutable, meaning their elements can be changed.

  • Elements in a list are accessed using indexing.

  • Lists can contain elements of different data types.

  • Lists are denoted by square brackets [].

Data Engineer Jobs

Data Engineer 8-13 years
Maersk Global Service Centres India Pvt. Ltd.
4.2
Bangalore / Bengaluru
Data Engineer 8-13 years
Ericsson India Global Services Pvt. Ltd.
4.1
Bangalore / Bengaluru
Sr. Data Engineer - Growth & Marketing 7-12 years
Uber
4.2
Bangalore / Bengaluru

Q159. What are window function in sql ? SQL joins? CTE in sql?

Ans.

Window functions are used to perform calculations across a set of rows in a table.

  • Window functions operate on a subset of rows called a window or frame.

  • They can be used to calculate running totals, rankings, and moving averages.

  • Examples of window functions include ROW_NUMBER, RANK, and LAG.

  • SQL joins are used to combine rows from two or more tables based on a related column between them.

  • Common types of joins include INNER JOIN, LEFT JOIN, and RIGHT JOIN.

  • CTE (Common Table Expre...read more

Q160. What is PySpark, and can you explain its features and uses?

Ans.

PySpark is a Python API for Apache Spark, used for big data processing and analytics.

  • PySpark is a Python API for Apache Spark, a fast and general-purpose cluster computing system.

  • It allows for easy integration with Python libraries and provides high-level APIs in Python.

  • PySpark can be used for processing large datasets, machine learning, real-time data streaming, and more.

  • It supports various data sources such as HDFS, Apache Hive, JSON, Parquet, and more.

  • PySpark is widely use...read more

Q161. When will you decide to use repartition and coalesce?

Ans.

Repartition is used for increasing partitions for parallelism, while coalesce is used for decreasing partitions to reduce shuffling.

  • Repartition is used when there is a need for more partitions to increase parallelism.

  • Coalesce is used when there are too many partitions and need to reduce them to avoid shuffling.

  • Example: Repartition can be used before a join operation to evenly distribute data across partitions for better performance.

  • Example: Coalesce can be used after a filter...read more

Q162. What type of schemas did you use for your project. (Star schema, Snowflake Schema)

Ans.

I have experience working with both Star and Snowflake schemas in my projects.

  • Star schema is a denormalized schema where one central fact table is connected to multiple dimension tables.

  • Snowflake schema is a normalized schema where dimension tables are further normalized into sub-dimension tables.

  • Used Star schema for simpler, smaller datasets where performance is a priority.

  • Used Snowflake schema for complex, larger datasets where data integrity and scalability are important.

Q163. How do you feel about changing technologies and adapt them?

Ans.

I am highly adaptable and embrace changing technologies with enthusiasm.

  • I believe that staying up-to-date with the latest technologies is crucial for success in the field of data engineering.

  • I actively seek out opportunities to learn and explore new technologies.

  • I enjoy the challenge of adapting to new tools and frameworks and find it intellectually stimulating.

  • I have successfully adapted to changing technologies in my previous roles, such as transitioning from traditional ET...read more

Q164. review of overall biodata starting from school till job

Ans.

I have a strong educational background in computer science and relevant work experience in data engineering.

  • Studied computer science in college

  • Completed internships in data engineering

  • Worked as a data engineer at XYZ company

Q165. what are different kind of triggers available in data factory and tell use case of each trigger

Ans.

Different kinds of triggers in Data Factory and their use cases

  • Schedule Trigger: Runs pipelines on a specified schedule, like daily or hourly

  • Tumbling Window Trigger: Triggers pipelines based on a defined window of time

  • Event Trigger: Triggers pipelines based on events like file arrival or HTTP request

  • Data Lake Storage Gen2 Trigger: Triggers pipelines when new data is added to a Data Lake Storage Gen2 account

Q166. how does aws glue work and how can etl be performed with glue.

Ans.

AWS Glue is a fully managed ETL service that makes it easy to prepare and load data for analytics.

  • AWS Glue works by automatically discovering, cataloging, and transforming your data

  • ETL can be performed with Glue by creating ETL jobs using the Glue console or API

  • Glue supports various data sources such as Amazon S3, RDS, Redshift, and more

  • You can schedule and monitor ETL jobs using Glue's built-in scheduler and monitoring tools

Q167. What is the difference between supervised and unsupervised learning?

Ans.

Supervised learning uses labeled data to train the model, while unsupervised learning uses unlabeled data.

  • Supervised learning requires a target variable to be predicted, while unsupervised learning does not.

  • In supervised learning, the model learns from labeled training data, whereas in unsupervised learning, the model finds patterns in unlabeled data.

  • Examples of supervised learning include regression and classification tasks, while clustering is a common unsupervised learning...read more

Q168. Python code to download file from aws s3 to local using boto3

Ans.

Python code to download file from aws s3 to local using boto3

  • Install boto3 library using pip

  • Create an S3 client using boto3

  • Use the download_file method to download the file from S3 to local

Q169. What are the components we use in graphs to remove duplicates

Ans.

Components used in graphs to remove duplicates include HashSet and HashMap.

  • Use HashSet to store unique elements

  • Use HashMap to store key-value pairs with unique keys

  • Iterate through the graph and add elements to HashSet or HashMap to remove duplicates

Q170. What factors should be considered when designing a road curve?

Ans.

Factors to consider when designing a road curve

  • Radius of the curve

  • Speed limit of the road

  • Banking of the curve

  • Visibility around the curve

  • Traffic volume on the road

  • Road surface conditions

  • Presence of obstacles or hazards

  • Environmental factors such as weather conditions

Q171. What is the difference between a tuple and a list in Python

Ans.

A tuple is immutable and uses parentheses, while a list is mutable and uses square brackets in Python.

  • Tuples are immutable, meaning their elements cannot be changed after creation

  • Lists are mutable, allowing for elements to be added, removed, or modified

  • Tuples are created using parentheses, while lists use square brackets

  • Example: tuple = (1, 2, 3) vs list = [1, 2, 3]

Q172. 1) Project Architecture 2) Complex job handles in project 3) Types of lookup 4) SCD -2 implementation in datastage 5) sql - analytical functions,scenario based question 6) Unix - SED/GREP command

Ans.

The interview questions cover project architecture, complex job handling, lookup types, SCD-2 implementation, SQL analytical functions, and Unix commands.

  • Project architecture involves designing the overall structure of a data project.

  • Complex job handling refers to managing intricate data processing tasks within a project.

  • Lookup types include exact match, range match, and fuzzy match.

  • SCD-2 implementation in DataStage involves capturing historical changes in data.

  • SQL analytical...read more

Q173. Joins in Sql, Modelling and visualization part in PowerBI

Ans.

Answering about joins in SQL and modeling/visualization in PowerBI

  • Joins in SQL are used to combine data from two or more tables based on a related column

  • There are different types of joins such as inner join, left join, right join, and full outer join

  • PowerBI is a data visualization tool that allows users to create interactive reports and dashboards

  • Data modeling in PowerBI involves creating relationships between tables and defining measures and calculated columns

  • Visualization i...read more

Q174. Projects he has worked on in the data engineering field

Ans.

I have worked on projects involving building data pipelines, optimizing data storage, and implementing data processing algorithms.

  • Built data pipelines to extract, transform, and load data from various sources

  • Optimized data storage by implementing efficient database schemas and indexing strategies

  • Implemented data processing algorithms for real-time and batch processing

  • Worked on data quality monitoring and data governance initiatives

Q175. What do you know about Forms and Templates and its use in workflow and webreports

Ans.

Forms and Templates are used in workflow and web reports to standardize data input and presentation.

  • Forms are used to collect data in a structured manner, often with predefined fields and formats

  • Templates are pre-designed layouts for presenting data in a consistent way

  • Forms and Templates help streamline processes, ensure data consistency, and improve reporting accuracy

  • In workflow management, Forms can be used to gather input from users at different stages of a process

  • Web repo...read more

Q176. Why there are 2 keys available in azure resources

Ans.

Two keys are available in Azure resources for security purposes.

  • One key is used for authentication and the other for authorization.

  • Authentication key is used to verify the identity of the user or application accessing the resource.

  • Authorization key is used to grant or deny access to specific resources or actions.

  • Having two keys adds an extra layer of security to Azure resources.

  • Examples of Azure resources that use two keys are Azure Storage and Azure Event Hubs.

Q177. do you have experience in aws glue, how will you use glue for data migration?

Ans.

Yes, I have experience in AWS Glue and can use it for data migration.

  • AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data for analytics.

  • To use Glue for data migration, I would start by creating a Glue job that defines the source and target data sources, as well as any transformations needed.

  • I would then configure the job to run on a schedule or trigger it manually to migrate the data from the source to the target.

  • G...read more

Q178. What is SparkContext and SparkSession?

Ans.

SparkContext is the main entry point for Spark functionality, while SparkSession is the entry point for Spark SQL.

  • SparkContext is the entry point for low-level API functionality in Spark.

  • SparkSession is the entry point for Spark SQL functionality.

  • SparkContext is used to create RDDs (Resilient Distributed Datasets) in Spark.

  • SparkSession provides a unified entry point for reading data from various sources and performing SQL queries.

Q179. what is difference between alter and update ?

Ans.

Alter is used to modify the structure of a table, while update is used to modify the data in a table.

  • Alter is used to add, remove, or modify columns in a table.

  • Update is used to change the values of existing records in a table.

  • Alter can change the structure of a table, such as adding a new column or changing the data type of a column.

  • Update is used to modify the data in a table, such as changing the value of a specific column in a specific row.

Q180. Is nested for each possible in ADF?

Ans.

Yes, nested for each is possible in ADF.

  • Nested for each can be used to iterate through nested arrays or objects.

  • It can be used in mapping data flows and pipelines.

  • Example: For each customer, for each order, for each item in order.

  • It can improve performance by reducing the number of activities in a pipeline.

Q181. SQL Window Functions and RANK, DENSE RANK Difference

Ans.

SQL Window Functions like RANK and DENSE RANK are used to assign a rank to rows within a partition.

  • RANK function assigns a unique rank to each distinct row within a partition.

  • DENSE RANK function assigns a unique rank to each distinct row within a partition, but without any gaps.

  • Both functions are used with the OVER() clause in SQL to define the partition and order of rows.

Q182. What are the challenges you faced during migrating any data from one system to other?

Ans.

Challenges faced during data migration include data loss, compatibility issues, downtime, and security concerns.

  • Data loss: Ensuring all data is successfully transferred without any loss or corruption.

  • Compatibility issues: Ensuring data formats, structures, and systems are compatible for seamless migration.

  • Downtime: Minimizing downtime during migration to avoid disruption to operations.

  • Security concerns: Ensuring data security and privacy are maintained throughout the migratio...read more

Q183. What are the different modes of execution in Apache Spark?

Ans.

The different modes of execution in Apache Spark include local mode, standalone mode, YARN mode, and Mesos mode.

  • Local mode: Spark runs on a single machine with one executor.

  • Standalone mode: Spark runs on a cluster managed by a standalone cluster manager.

  • YARN mode: Spark runs on a Hadoop cluster using YARN as the resource manager.

  • Mesos mode: Spark runs on a Mesos cluster with Mesos as the resource manager.

Q184. What is sql , explain normalizing ?

Ans.

SQL is a programming language used to manage and manipulate relational databases. Normalizing is the process of organizing data in a database to minimize redundancy.

  • SQL stands for Structured Query Language

  • It is used to create, modify, and query relational databases

  • Normalization is the process of breaking down a database into smaller, more manageable tables to reduce redundancy and improve data integrity

  • There are different levels of normalization, such as first normal form (1N...read more

Q185. What data structure is used to implement a priority queue?

Ans.

Binary heap is used to implement a priority queue.

  • Binary heap is a complete binary tree where each node has a value greater than or equal to its children.

  • It can be implemented using arrays, where the parent of node at index i is at index (i-1)/2.

  • Insertion and deletion operations have time complexity of O(log n) in a binary heap.

Q186. SQL - how do you identify the long running queries.

Ans.

Identifying long running queries in SQL

  • Monitor query execution times using tools like SQL Server Profiler or Performance Monitor

  • Check system views like sys.dm_exec_requests or sys.dm_exec_query_stats for query durations

  • Use query hints like OPTION (RECOMPILE) to force recompilation of queries for better performance

Q187. How do you treat outliers in an Dataset

Ans.

Outliers in a dataset can be treated by identifying them using statistical methods and then either removing them, transforming them, or keeping them based on the context.

  • Identify outliers using statistical methods like Z-score, IQR, or visualization techniques like box plots.

  • Decide whether to remove outliers, transform them using techniques like winsorization or log transformation, or keep them based on the context of the analysis.

  • Consider the impact of outliers on the analys...read more

Q188. How to connect to adls gen 2 from Databricks

Ans.

To connect to ADLS Gen2 from Databricks, you can use the Azure Blob Storage API.

  • Use the Azure Blob Storage API to connect to ADLS Gen2 from Databricks

  • Provide the storage account name and key for authentication

  • Use the storage account name as the filesystem

  • Example: spark.conf.set('fs.azure.account.key..blob.core.windows.net', '')

Q189. What is the purpose of using Common Table Expressions (CTE) in SQL?

Ans.

CTEs are used in SQL to create temporary result sets that can be referenced within a query.

  • CTEs improve readability and maintainability of complex queries

  • They can be recursively referenced to solve hierarchical problems

  • CTEs can be used to simplify queries by breaking them into smaller, more manageable parts

Q190. Write Python code to Extract Pincode from Address Field in Dataframe using Pyspark?

Ans.

Extract Pincode from Address Field in Dataframe using Pyspark

  • Use pyspark.sql.functions regexp_extract() function to extract pincode from address field

  • Create a new column in the dataframe to store the extracted pincode

  • Specify the regular expression pattern for pincode extraction

  • Example: df.withColumn('pincode', regexp_extract(df['address'], '\b\d{6}\b', 0))

Q191. What is the activity use for creating email notification?

Ans.

The activity used for creating email notification is sending an email.

  • Use SMTP (Simple Mail Transfer Protocol) to send emails

  • Set up an email server or use a third-party email service provider

  • Include the recipient's email address, subject, and message content

  • Can be automated using tools like Python's smtplib library or email marketing platforms like Mailchimp

Q192. 1. How spark process the code? 2. difference between repartition and coalesce. 3. How to connect sql server from databricks

Ans.

Spark processes code using DAG (Directed Acyclic Graph) which optimizes and executes tasks in parallel.

  • Spark processes code by creating a DAG of tasks based on transformations and actions in the code.

  • It optimizes the DAG by combining tasks and executing them in parallel to maximize efficiency.

  • Repartition increases or decreases the number of partitions in a DataFrame, while coalesce only decreases partitions without shuffling data.

  • To connect SQL Server from Databricks, you can...read more

Q193. Airflow operators and what is the use of Airflow python operator

Ans.

Airflow operators are used to define tasks in a workflow. The Airflow Python operator is used to execute Python functions as tasks.

  • Airflow operators are used to define individual tasks in a workflow

  • The Airflow Python operator is specifically used to execute Python functions as tasks

  • It allows for flexibility in defining custom tasks using Python code

  • Example: PythonOperator(task_id='my_task', python_callable=my_python_function)

Q194. How do you select the unique customers in the last 3 months sales

Ans.

Use SQL query to select unique customers in last 3 months sales

  • Filter sales data for the last 3 months

  • Use DISTINCT keyword to select unique customers

  • Join with customer table if necessary

Q195. What is data proc and why u choose it in ur project

Ans.

Data proc is short for data processing, which involves transforming raw data into a more usable format for analysis.

  • Data proc involves cleaning, transforming, and aggregating raw data

  • It helps in preparing data for analysis and visualization

  • Examples include cleaning and formatting data from multiple sources before loading into a database

Q196. How to find delta between two tables in SQL?

Ans.

To find delta between two tables in SQL, use the EXCEPT or MINUS operator.

  • Use the EXCEPT operator in SQL to return rows from the first table that do not exist in the second table.

  • Use the MINUS operator in SQL to return distinct rows from the first table that do not exist in the second table.

Q197. How can we join a table without any identity columns?

Ans.

You can join tables without identity columns using other unique columns or composite keys.

  • Use other unique columns or composite keys to join the tables

  • Consider using a combination of columns to create a unique identifier for joining

  • If no unique columns are available, consider using a combination of non-unique columns with additional logic to ensure accurate joins

Q198. SQL: To calculate the difference in marks for each student ID and marks across different years?

Ans.

Use SQL to calculate the difference in marks for each student ID across different years.

  • Use a self join on the table to compare marks for the same student ID across different years.

  • Calculate the difference in marks by subtracting the marks from different years.

  • Group the results by student ID to get the difference in marks for each student.

Q199. How long work with SQL and database?

Ans.

I have been working with SQL and databases for 5 years.

  • I have 5 years of experience working with SQL and databases.

  • I have worked on various database management systems such as MySQL, PostgreSQL, and SQL Server.

  • I have experience writing complex SQL queries, optimizing database performance, and designing database schemas.

  • I have also worked on data modeling, ETL processes, and data warehousing.

  • I am proficient in using tools like SQL Server Management Studio, MySQL Workbench, and...read more

Q200. If I am okay to work in remote and sometimes hybrid as well

Ans.

Yes, I am comfortable working remotely and in a hybrid setup.

  • I have experience working remotely and understand the importance of communication and collaboration in a virtual environment.

  • I am adaptable and can easily transition between remote and hybrid work setups.

  • I have the necessary technology and tools to effectively work remotely and stay connected with the team.

  • I am open to occasional in-person meetings or team gatherings in a hybrid setup.

Previous
1
2
3
4
5
6
7
Next
Interview Tips & Stories
Ace your next interview with expert advice and inspiring stories

Interview experiences of popular companies

3.7
 • 10.4k Interviews
3.8
 • 8.1k Interviews
3.6
 • 7.5k Interviews
3.7
 • 5.6k Interviews
3.7
 • 5.6k Interviews
4.1
 • 5k Interviews
3.7
 • 4.8k Interviews
3.8
 • 2.9k Interviews
4.0
 • 2.3k Interviews
View all

Calculate your in-hand salary

Confused about how your in-hand salary is calculated? Enter your annual salary (CTC) and get your in-hand salary

Data Engineer Interview Questions
Share an Interview
Stay ahead in your career. Get AmbitionBox app
qr-code
Helping over 1 Crore job seekers every month in choosing their right fit company
65 L+

Reviews

4 L+

Interviews

4 Cr+

Salaries

1 Cr+

Users/Month

Contribute to help millions

Made with ❤️ in India. Trademarks belong to their respective owners. All rights reserved © 2024 Info Edge (India) Ltd.

Follow us
  • Youtube
  • Instagram
  • LinkedIn
  • Facebook
  • Twitter