Data Engineer
1000+ Data Engineer Interview Questions and Answers

Asked in IBM

Q. How do you create a Kafka topic with a replication factor of 2?
To create a Kafka topic with replication factor 2, use the command line tool or Kafka API.
Use the command line tool 'kafka-topics.sh' with the '--replication-factor' flag set to 2.
Alternatively, use the Kafka API to create a topic with a replication factor of 2.
Ensure that the number of brokers in the Kafka cluster is greater than or equal to the replication factor.
Consider setting the 'min.insync.replicas' configuration property to 2 to ensure that at least two replicas are ...read more

Asked in Accenture

Q. What happens when we enforce the schema and when we manually define the schema in the code?
Enforcing the schema ensures data consistency and validation, while manually defining the schema in code allows for more flexibility and customization.
Enforcing the schema ensures that all data conforms to a predefined structure and format, preventing errors and inconsistencies.
Manually defining the schema in code allows for more flexibility in handling different data types and structures.
Enforcing the schema can be done using tools like Apache Avro or Apache Parquet, while m...read more

Asked in Datamatics Global Services

Q. There are 4 balls of each color (Red, Green, Blue) in a box. If you draw 3 balls randomly, what is the probability of all 3 balls being the same color?
The probability of drawing 3 balls of the same color from a box containing 4 balls of each color (Red, Green, Blue).
Calculate the total number of ways to draw 3 balls out of 12 balls
Calculate the number of ways to draw 3 balls of the same color
Divide the number of favorable outcomes by the total number of outcomes to get the probability

Asked in LTIMindtree

Q. How do you do performance optimization in Spark? How did you do it in your project?
Performance optimization in Spark involves tuning configurations, optimizing code, and utilizing caching.
Tune Spark configurations such as executor memory, number of executors, and shuffle partitions.
Optimize code by reducing unnecessary shuffles, using efficient transformations, and avoiding unnecessary data movements.
Utilize caching to store intermediate results in memory and avoid recomputation.
Example: In my project, I optimized Spark performance by increasing executor me...read more

Asked in Publicis Sapient

Q. What happens if a job fails in the pipeline after the data processing cycle is complete?
If a job fails in the pipeline and data processing cycle is over, it can lead to incomplete or inaccurate data.
Incomplete data may affect downstream processes and analysis
Data quality may be compromised if errors are not addressed
Monitoring and alerting systems should be in place to detect and handle failures
Re-running the failed job or implementing error handling mechanisms can help prevent issues in the future

Asked in HSBC Group

Q. 1. What is udf in Spark? 2. Write PySpark code to check the validity of mobile_number column
UDF stands for User-Defined Function in Spark. It allows users to define their own functions to process data.
UDFs can be written in different programming languages like Python, Scala, and Java.
UDFs can be used to perform complex operations on data that are not available in built-in functions.
PySpark code to check the validity of mobile_number column can be written using regular expressions and the `regexp_extract` function.
Example: `df.select('mobile_number', regexp_extract('...read more
Data Engineer Jobs




Asked in Accenture

Q. How many stages will be created from the code I have written?
The number of stages created from the code provided depends on the specific code and its functionality.
The number of stages can vary based on the complexity of the code and the specific tasks being performed.
Stages may include data extraction, transformation, loading, and processing.
It is important to analyze the code and identify distinct stages to determine the total number.

Asked in HSBC Group

Q. Merge two unsorted lists such that the output list is sorted. You are free to use inbuilt sorting functions to sort the input lists
Merge two unsorted lists into a sorted list using inbuilt sorting functions.
Use inbuilt sorting functions to sort the input lists
Merge the sorted lists using a merge algorithm
Return the merged and sorted list
Share interview questions and help millions of jobseekers 🌟

Asked in Fractal Analytics

Q. What are the key features and functionalities of Snowflake?
Snowflake is a cloud-based data warehousing platform known for its scalability, performance, and ease of use.
Snowflake uses a unique architecture called multi-cluster, which separates storage and compute resources for better scalability and performance.
It supports both structured and semi-structured data, allowing users to work with various data types.
Snowflake offers features like automatic scaling, data sharing, and built-in support for SQL queries.
It provides a web interfa...read more

Asked in Accenture

Q. What are internal and external tables in Hive?
Internal tables store data within Hive's warehouse directory while external tables store data outside of it.
Internal tables are managed by Hive and are deleted when the table is dropped
External tables are not managed by Hive and data is not deleted when the table is dropped
Internal tables are faster for querying as data is stored within Hive's warehouse directory
External tables are useful for sharing data between different systems
Example: CREATE TABLE my_table (col1 INT, col2...read more

Asked in Rakuten

Q. Describe a case study on how to determine the optimal location to build a store using a data science approach.
Utilize data science to analyze demographics, competition, and location factors for optimal store placement.
Analyze demographic data: Use census data to identify population density and income levels in potential areas.
Evaluate competition: Map existing stores and assess their performance to find underserved locations.
Consider foot traffic: Use mobile data or sensors to measure foot traffic in different areas at various times.
Assess accessibility: Analyze transportation networ...read more

Asked in Procore

Q. What is Data Lake? Difference between data lake and data warehouse
Data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed.
Data lake stores raw, unstructured data from various sources.
Data lake allows for storing large amounts of data without the need for a predefined schema.
Data lake is cost-effective for storing data that may not have a clear use case at the time of storage.
Data warehouse stores structured data for querying and analysis.
Data warehouse requires a predefined schema for d...read more

Asked in Capgemini

Q. How would you join two large tables in PySpark?
Use broadcast join or partition join in pyspark to join two large tables efficiently.
Use broadcast join for smaller table and partition join for larger table.
Broadcast join - broadcast the smaller table to all worker nodes.
Partition join - partition both tables on the join key and join them.
Example: df1.join(broadcast(df2), 'join_key')
Example: df1.join(df2, 'join_key').repartition('join_key')

Asked in Ernst & Young

Q. Write a SQL query to find the highest salary for each employee in each department, given the employees and department tables.
Find the highest salary for each employee in their respective departments using SQL queries.
Use the 'employees' table to get employee details including salary.
Join the 'employees' table with the 'departments' table on department ID.
Use the SQL 'GROUP BY' clause to group results by department.
Utilize the 'MAX()' function to find the highest salary within each department.
Example SQL query: SELECT department_id, MAX(salary) FROM employees GROUP BY department_id;

Asked in PwC

Q. What is data flow? Difference with ADF pipeline and data flow
Data flow is a visual representation of data movement and transformation. ADF pipeline is a set of activities to move and transform data.
Data flow is a drag-and-drop interface to design data transformation logic
ADF pipeline is a set of activities to orchestrate data movement and transformation
Data flow is more flexible and powerful than ADF pipeline
Data flow can be used to transform data within a pipeline or as a standalone entity

Asked in Infosys

Q. What is the difference between DBMS and RDBMS?
DBMS is a software system to manage databases while RDBMS is a type of DBMS that stores data in a structured manner.
DBMS stands for Database Management System while RDBMS stands for Relational Database Management System.
DBMS can manage any type of database while RDBMS manages only relational databases.
DBMS does not enforce any specific data model while RDBMS enforces the relational data model.
Examples of DBMS include MongoDB and Cassandra while examples of RDBMS include MySQL...read more

Asked in Cognizant

Q. What are all the issues you faced in your project? What is Global Parameter? Why do we need parameters inADF? What are the API's in Spark?
Answering questions related to data engineering
Issues faced in project: data quality, scalability, performance, integration
Global parameter: a parameter that can be accessed across multiple components in a system
Parameters in ADF: used to pass values between activities in a pipeline
APIs in Spark: Spark SQL, Spark Streaming, MLlib, GraphX
Asked in Advent Informatics

Q. Introduction Project flow Why did you use HBase in your project? How did you query for data in HBase? What was the purpose of Hive? What are external partitioned tables? Optimization done in your projects
Discussion on project flow, HBase, Hive, external partitioned tables, and optimization in a Data Engineer interview.
Explained project flow and the reason for using HBase in the project
Discussed querying data in HBase and the purpose of Hive
Described external partitioned tables and optimization techniques used in the project

Asked in Capgemini

Q. What Spark configuration would you use to process 2 GB of data?
Set spark configuration with appropriate memory and cores for efficient processing of 2 GB data
Increase executor memory and cores to handle larger data size
Adjust spark memory overhead to prevent out of memory errors
Optimize shuffle partitions for better performance

Asked in Schneider Electric

Q. Write a Python program to convert a number to words. For example: input 123, output - One hundred twenty three.
Python program to convert a number to words.
Use a dictionary to map numbers to words.
Divide the number into groups of three digits and convert each group to words.
Handle special cases like zero, negative numbers, and numbers greater than or equal to one billion.

Asked in Procore

Q. Why do we need a data warehouse? Why can't we store data in a normal transactional database?
Data warehouses are designed for analytical queries and reporting, while transactional databases are optimized for transactional processing.
Data warehouses are optimized for read-heavy workloads, allowing for complex queries and reporting.
Transactional databases are optimized for write-heavy workloads, ensuring data integrity and consistency.
Data warehouses often store historical data for analysis, while transactional databases focus on current data for operational purposes.
D...read more
Asked in Skewb Analytics

Q. Can you provide an example of how to handle different categorical values based on their frequency?
Treating categorised values based on frequency involves grouping rare values together.
Identify rare values based on their frequency distribution
Group rare values together to reduce complexity
Consider creating a separate category for rare values

Asked in Accenture

Q. What are the methods for migrating a Hive metastore to Unity Catalog in Databricks?
Use Databricks provided tools like databricks-connect and databricks-cli to migrate Hive metadata to Unity catalog.
Use databricks-connect to connect to the Databricks workspace from your local development environment.
Use databricks-cli to export the Hive metadata from the existing Hive metastore.
Create a new Unity catalog in Databricks and import the exported metadata using databricks-cli.
Validate the migration by checking the tables and databases in the Unity catalog.

Asked in Kellogg

Q. Explaination of current project architecture, Cloud services used in project and purpose of using them. Architecture of Spark,Hive
Our project architecture uses Spark and Hive for data processing and storage respectively. We utilize AWS services such as S3, EMR, and Glue for scalability and cost-effectiveness.
Spark is used for distributed data processing and analysis
Hive is used for data warehousing and querying
AWS S3 is used for storing large amounts of data
AWS EMR is used for running Spark and Hive clusters
AWS Glue is used for ETL (Extract, Transform, Load) jobs
The purpose of using these services is to...read more

Asked in KPMG India

Q. Given a dictionary, how do you find the greatest number for the same key in Python?
Find the greatest number for same key in a Python dictionary.
Use max() function with key parameter to find the maximum value for each key in the dictionary.
Iterate through the dictionary and apply max() function on each key.
If the dictionary is nested, use recursion to iterate through all the keys.

Asked in KPMG India

Q. 1. What is columnar storage,parquet,delta? Why it is used
Columnar storage is a data storage format that stores data in columns rather than rows, improving query performance.
Columnar storage stores data in a column-wise manner instead of row-wise.
It improves query performance by reducing the amount of data that needs to be read from disk.
Parquet is a columnar storage file format that is optimized for big data workloads.
It is used in Apache Spark and other big data processing frameworks.
Delta is an open-source storage layer that prov...read more

Asked in IBM

Q. How do you read JSON data using Spark?
To read JSON data using Spark, use the SparkSession.read.json() method.
Create a SparkSession object
Use the read.json() method to read the JSON data
Specify the path to the JSON file or directory containing JSON files
The resulting DataFrame can be manipulated using Spark's DataFrame API

Asked in Capgemini

Q. Write a SQL query to get the names of students who scored greater than 45 in each subject from the Student table.
SQL query to retrieve student names with marks > 45 in each subject
Use GROUP BY and HAVING clauses to filter students with marks > 45 in each subject
Join Student table with Marks table on student_id to get marks for each student
Select student names from Student table based on the conditions

Asked in ValueMomentum

Q. How would you process 1 million records using ADF efficiently, given that your compute can only process 10,000 records at a time?
Efficiently process 1 million records in ADF by batching and parallelism.
Use Data Flow in ADF to create a pipeline that processes data in batches of 10,000 records.
Implement a ForEach activity to iterate over the batches, allowing parallel execution.
Utilize the 'Batch Size' setting in the Copy Data activity to control the number of records processed at once.
Consider partitioning the data source if applicable, to improve performance and reduce processing time.
Monitor and optim...read more

Asked in Shiprocket Private Limited

Q. Design a generic tool or package using PySpark that allows creating connections to multiple databases like MySQL, S3, or APIs. Fetch the results, perform transformations like handling null values, and then stor...
read moreDesign a generic tool in pyspark to connect to multiple databases, fetch results, handle null values, and store output in another database
Use pyspark to create a tool that can connect to databases like mysql, s3, or api
Implement functions to fetch data from the databases and perform transformations like handling null values
Utilize pyspark to store the transformed data in another database
Consider using pyspark SQL functions for data transformations
Interview Questions of Similar Designations
Interview Experiences of Popular Companies





Top Interview Questions for Data Engineer Related Skills



Reviews
Interviews
Salaries
Users

