Data Engineer

1000+ Data Engineer Interview Questions and Answers

Updated 11 Jul 2025
search-icon

Asked in IBM

1d ago

Q. How do you create a Kafka topic with a replication factor of 2?

Ans.

To create a Kafka topic with replication factor 2, use the command line tool or Kafka API.

  • Use the command line tool 'kafka-topics.sh' with the '--replication-factor' flag set to 2.

  • Alternatively, use the Kafka API to create a topic with a replication factor of 2.

  • Ensure that the number of brokers in the Kafka cluster is greater than or equal to the replication factor.

  • Consider setting the 'min.insync.replicas' configuration property to 2 to ensure that at least two replicas are ...read more

Asked in Accenture

5d ago

Q. What happens when we enforce the schema and when we manually define the schema in the code?

Ans.

Enforcing the schema ensures data consistency and validation, while manually defining the schema in code allows for more flexibility and customization.

  • Enforcing the schema ensures that all data conforms to a predefined structure and format, preventing errors and inconsistencies.

  • Manually defining the schema in code allows for more flexibility in handling different data types and structures.

  • Enforcing the schema can be done using tools like Apache Avro or Apache Parquet, while m...read more

Q. There are 4 balls of each color (Red, Green, Blue) in a box. If you draw 3 balls randomly, what is the probability of all 3 balls being the same color?

Ans.

The probability of drawing 3 balls of the same color from a box containing 4 balls of each color (Red, Green, Blue).

  • Calculate the total number of ways to draw 3 balls out of 12 balls

  • Calculate the number of ways to draw 3 balls of the same color

  • Divide the number of favorable outcomes by the total number of outcomes to get the probability

Asked in LTIMindtree

4d ago

Q. How do you do performance optimization in Spark? How did you do it in your project?

Ans.

Performance optimization in Spark involves tuning configurations, optimizing code, and utilizing caching.

  • Tune Spark configurations such as executor memory, number of executors, and shuffle partitions.

  • Optimize code by reducing unnecessary shuffles, using efficient transformations, and avoiding unnecessary data movements.

  • Utilize caching to store intermediate results in memory and avoid recomputation.

  • Example: In my project, I optimized Spark performance by increasing executor me...read more

Are these interview questions helpful?
4d ago

Q. What happens if a job fails in the pipeline after the data processing cycle is complete?

Ans.

If a job fails in the pipeline and data processing cycle is over, it can lead to incomplete or inaccurate data.

  • Incomplete data may affect downstream processes and analysis

  • Data quality may be compromised if errors are not addressed

  • Monitoring and alerting systems should be in place to detect and handle failures

  • Re-running the failed job or implementing error handling mechanisms can help prevent issues in the future

Asked in HSBC Group

2w ago

Q. 1. What is udf in Spark? 2. Write PySpark code to check the validity of mobile_number column

Ans.

UDF stands for User-Defined Function in Spark. It allows users to define their own functions to process data.

  • UDFs can be written in different programming languages like Python, Scala, and Java.

  • UDFs can be used to perform complex operations on data that are not available in built-in functions.

  • PySpark code to check the validity of mobile_number column can be written using regular expressions and the `regexp_extract` function.

  • Example: `df.select('mobile_number', regexp_extract('...read more

Data Engineer Jobs

SANOFI HEALTHCARE INDIA PRIVATE LIMITED logo
Data Engineer 2-3 years
SANOFI HEALTHCARE INDIA PRIVATE LIMITED
4.1
Hyderabad / Secunderabad
Mondel z International logo
Data Engineer 2-9 years
Mondel z International
4.2
Mumbai
AMERICAN EXPRESS logo
Data Engineer 2-4 years
AMERICAN EXPRESS
4.1
₹ 13 L/yr - ₹ 17 L/yr
Gurgaon / Gurugram

Asked in Accenture

5d ago

Q. How many stages will be created from the code I have written?

Ans.

The number of stages created from the code provided depends on the specific code and its functionality.

  • The number of stages can vary based on the complexity of the code and the specific tasks being performed.

  • Stages may include data extraction, transformation, loading, and processing.

  • It is important to analyze the code and identify distinct stages to determine the total number.

Asked in HSBC Group

1w ago

Q. Merge two unsorted lists such that the output list is sorted. You are free to use inbuilt sorting functions to sort the input lists

Ans.

Merge two unsorted lists into a sorted list using inbuilt sorting functions.

  • Use inbuilt sorting functions to sort the input lists

  • Merge the sorted lists using a merge algorithm

  • Return the merged and sorted list

Share interview questions and help millions of jobseekers 🌟

man-with-laptop
2w ago

Q. What are the key features and functionalities of Snowflake?

Ans.

Snowflake is a cloud-based data warehousing platform known for its scalability, performance, and ease of use.

  • Snowflake uses a unique architecture called multi-cluster, which separates storage and compute resources for better scalability and performance.

  • It supports both structured and semi-structured data, allowing users to work with various data types.

  • Snowflake offers features like automatic scaling, data sharing, and built-in support for SQL queries.

  • It provides a web interfa...read more

Asked in Accenture

5d ago

Q. What are internal and external tables in Hive?

Ans.

Internal tables store data within Hive's warehouse directory while external tables store data outside of it.

  • Internal tables are managed by Hive and are deleted when the table is dropped

  • External tables are not managed by Hive and data is not deleted when the table is dropped

  • Internal tables are faster for querying as data is stored within Hive's warehouse directory

  • External tables are useful for sharing data between different systems

  • Example: CREATE TABLE my_table (col1 INT, col2...read more

Asked in Rakuten

2w ago

Q. Describe a case study on how to determine the optimal location to build a store using a data science approach.

Ans.

Utilize data science to analyze demographics, competition, and location factors for optimal store placement.

  • Analyze demographic data: Use census data to identify population density and income levels in potential areas.

  • Evaluate competition: Map existing stores and assess their performance to find underserved locations.

  • Consider foot traffic: Use mobile data or sensors to measure foot traffic in different areas at various times.

  • Assess accessibility: Analyze transportation networ...read more

Asked in Procore

2w ago

Q. What is Data Lake? Difference between data lake and data warehouse

Ans.

Data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed.

  • Data lake stores raw, unstructured data from various sources.

  • Data lake allows for storing large amounts of data without the need for a predefined schema.

  • Data lake is cost-effective for storing data that may not have a clear use case at the time of storage.

  • Data warehouse stores structured data for querying and analysis.

  • Data warehouse requires a predefined schema for d...read more

Asked in Capgemini

4d ago

Q. How would you join two large tables in PySpark?

Ans.

Use broadcast join or partition join in pyspark to join two large tables efficiently.

  • Use broadcast join for smaller table and partition join for larger table.

  • Broadcast join - broadcast the smaller table to all worker nodes.

  • Partition join - partition both tables on the join key and join them.

  • Example: df1.join(broadcast(df2), 'join_key')

  • Example: df1.join(df2, 'join_key').repartition('join_key')

1w ago

Q. Write a SQL query to find the highest salary for each employee in each department, given the employees and department tables.

Ans.

Find the highest salary for each employee in their respective departments using SQL queries.

  • Use the 'employees' table to get employee details including salary.

  • Join the 'employees' table with the 'departments' table on department ID.

  • Use the SQL 'GROUP BY' clause to group results by department.

  • Utilize the 'MAX()' function to find the highest salary within each department.

  • Example SQL query: SELECT department_id, MAX(salary) FROM employees GROUP BY department_id;

Asked in PwC

2w ago

Q. What is data flow? Difference with ADF pipeline and data flow

Ans.

Data flow is a visual representation of data movement and transformation. ADF pipeline is a set of activities to move and transform data.

  • Data flow is a drag-and-drop interface to design data transformation logic

  • ADF pipeline is a set of activities to orchestrate data movement and transformation

  • Data flow is more flexible and powerful than ADF pipeline

  • Data flow can be used to transform data within a pipeline or as a standalone entity

Asked in Infosys

2w ago

Q. What is the difference between DBMS and RDBMS?

Ans.

DBMS is a software system to manage databases while RDBMS is a type of DBMS that stores data in a structured manner.

  • DBMS stands for Database Management System while RDBMS stands for Relational Database Management System.

  • DBMS can manage any type of database while RDBMS manages only relational databases.

  • DBMS does not enforce any specific data model while RDBMS enforces the relational data model.

  • Examples of DBMS include MongoDB and Cassandra while examples of RDBMS include MySQL...read more

Asked in Cognizant

1w ago

Q. What are all the issues you faced in your project? What is Global Parameter? Why do we need parameters inADF? What are the API's in Spark?

Ans.

Answering questions related to data engineering

  • Issues faced in project: data quality, scalability, performance, integration

  • Global parameter: a parameter that can be accessed across multiple components in a system

  • Parameters in ADF: used to pass values between activities in a pipeline

  • APIs in Spark: Spark SQL, Spark Streaming, MLlib, GraphX

3d ago

Q. Introduction Project flow Why did you use HBase in your project? How did you query for data in HBase? What was the purpose of Hive? What are external partitioned tables? Optimization done in your projects

Ans.

Discussion on project flow, HBase, Hive, external partitioned tables, and optimization in a Data Engineer interview.

  • Explained project flow and the reason for using HBase in the project

  • Discussed querying data in HBase and the purpose of Hive

  • Described external partitioned tables and optimization techniques used in the project

Asked in Capgemini

2w ago

Q. What Spark configuration would you use to process 2 GB of data?

Ans.

Set spark configuration with appropriate memory and cores for efficient processing of 2 GB data

  • Increase executor memory and cores to handle larger data size

  • Adjust spark memory overhead to prevent out of memory errors

  • Optimize shuffle partitions for better performance

2d ago

Q. Write a Python program to convert a number to words. For example: input 123, output - One hundred twenty three.

Ans.

Python program to convert a number to words.

  • Use a dictionary to map numbers to words.

  • Divide the number into groups of three digits and convert each group to words.

  • Handle special cases like zero, negative numbers, and numbers greater than or equal to one billion.

Asked in Procore

2w ago

Q. Why do we need a data warehouse? Why can't we store data in a normal transactional database?

Ans.

Data warehouses are designed for analytical queries and reporting, while transactional databases are optimized for transactional processing.

  • Data warehouses are optimized for read-heavy workloads, allowing for complex queries and reporting.

  • Transactional databases are optimized for write-heavy workloads, ensuring data integrity and consistency.

  • Data warehouses often store historical data for analysis, while transactional databases focus on current data for operational purposes.

  • D...read more

1w ago

Q. Can you provide an example of how to handle different categorical values based on their frequency?

Ans.

Treating categorised values based on frequency involves grouping rare values together.

  • Identify rare values based on their frequency distribution

  • Group rare values together to reduce complexity

  • Consider creating a separate category for rare values

Asked in Accenture

1w ago

Q. What are the methods for migrating a Hive metastore to Unity Catalog in Databricks?

Ans.

Use Databricks provided tools like databricks-connect and databricks-cli to migrate Hive metadata to Unity catalog.

  • Use databricks-connect to connect to the Databricks workspace from your local development environment.

  • Use databricks-cli to export the Hive metadata from the existing Hive metastore.

  • Create a new Unity catalog in Databricks and import the exported metadata using databricks-cli.

  • Validate the migration by checking the tables and databases in the Unity catalog.

Asked in Kellogg

2w ago

Q. Explaination of current project architecture, Cloud services used in project and purpose of using them. Architecture of Spark,Hive

Ans.

Our project architecture uses Spark and Hive for data processing and storage respectively. We utilize AWS services such as S3, EMR, and Glue for scalability and cost-effectiveness.

  • Spark is used for distributed data processing and analysis

  • Hive is used for data warehousing and querying

  • AWS S3 is used for storing large amounts of data

  • AWS EMR is used for running Spark and Hive clusters

  • AWS Glue is used for ETL (Extract, Transform, Load) jobs

  • The purpose of using these services is to...read more

Asked in KPMG India

1w ago

Q. Given a dictionary, how do you find the greatest number for the same key in Python?

Ans.

Find the greatest number for same key in a Python dictionary.

  • Use max() function with key parameter to find the maximum value for each key in the dictionary.

  • Iterate through the dictionary and apply max() function on each key.

  • If the dictionary is nested, use recursion to iterate through all the keys.

Asked in KPMG India

1w ago

Q. 1. What is columnar storage,parquet,delta? Why it is used

Ans.

Columnar storage is a data storage format that stores data in columns rather than rows, improving query performance.

  • Columnar storage stores data in a column-wise manner instead of row-wise.

  • It improves query performance by reducing the amount of data that needs to be read from disk.

  • Parquet is a columnar storage file format that is optimized for big data workloads.

  • It is used in Apache Spark and other big data processing frameworks.

  • Delta is an open-source storage layer that prov...read more

Asked in IBM

1w ago

Q. How do you read JSON data using Spark?

Ans.

To read JSON data using Spark, use the SparkSession.read.json() method.

  • Create a SparkSession object

  • Use the read.json() method to read the JSON data

  • Specify the path to the JSON file or directory containing JSON files

  • The resulting DataFrame can be manipulated using Spark's DataFrame API

Asked in Capgemini

1w ago

Q. Write a SQL query to get the names of students who scored greater than 45 in each subject from the Student table.

Ans.

SQL query to retrieve student names with marks > 45 in each subject

  • Use GROUP BY and HAVING clauses to filter students with marks > 45 in each subject

  • Join Student table with Marks table on student_id to get marks for each student

  • Select student names from Student table based on the conditions

2w ago

Q. How would you process 1 million records using ADF efficiently, given that your compute can only process 10,000 records at a time?

Ans.

Efficiently process 1 million records in ADF by batching and parallelism.

  • Use Data Flow in ADF to create a pipeline that processes data in batches of 10,000 records.

  • Implement a ForEach activity to iterate over the batches, allowing parallel execution.

  • Utilize the 'Batch Size' setting in the Copy Data activity to control the number of records processed at once.

  • Consider partitioning the data source if applicable, to improve performance and reduce processing time.

  • Monitor and optim...read more

Q. Design a generic tool or package using PySpark that allows creating connections to multiple databases like MySQL, S3, or APIs. Fetch the results, perform transformations like handling null values, and then stor...

read more
Ans.

Design a generic tool in pyspark to connect to multiple databases, fetch results, handle null values, and store output in another database

  • Use pyspark to create a tool that can connect to databases like mysql, s3, or api

  • Implement functions to fetch data from the databases and perform transformations like handling null values

  • Utilize pyspark to store the transformed data in another database

  • Consider using pyspark SQL functions for data transformations

Previous
1
2
3
4
5
6
7
Next

Interview Experiences of Popular Companies

TCS Logo
3.6
 • 11.1k Interviews
Accenture Logo
3.7
 • 8.7k Interviews
Infosys Logo
3.6
 • 7.9k Interviews
Wipro Logo
3.7
 • 6.1k Interviews
Cognizant Logo
3.7
 • 5.9k Interviews
View all
interview tips and stories logo
Interview Tips & Stories
Ace your next interview with expert advice and inspiring stories
Data Engineer Interview Questions
Share an Interview
Stay ahead in your career. Get AmbitionBox app
play-icon
play-icon
qr-code
Trusted by over 1.5 Crore job seekers to find their right fit company
80 L+

Reviews

10L+

Interviews

4 Cr+

Salaries

1.5 Cr+

Users

Contribute to help millions

Made with ❤️ in India. Trademarks belong to their respective owners. All rights reserved © 2025 Info Edge (India) Ltd.

Follow Us
  • Youtube
  • Instagram
  • LinkedIn
  • Facebook
  • Twitter
Profile Image
Hello, Guest
AmbitionBox Employee Choice Awards 2025
Winners announced!
awards-icon
Contribute to help millions!
Write a review
Write a review
Share interview
Share interview
Contribute salary
Contribute salary
Add office photos
Add office photos
Add office benefits
Add office benefits