Data Engineer

900+ Data Engineer Interview Questions and Answers

Updated 16 Dec 2024

Popular Companies

search-icon
Q1. Optimal Strategy for a Game

You and your friend Ninjax are playing a game of coins. Ninjax place the 'N' number of coins in a straight line.

The rule of the game is as follows:

1. Each coin has a value associate...read more
Ans.

The task is to find the maximum amount you can definitely win in a game of coins against an opponent who plays optimally.

  • The game is played with alternating turns, and each player can pick the first or last coin from the line.

  • The value associated with the picked coin adds up to the total amount the player wins.

  • To maximize your winnings, you need to consider all possible combinations of coin picks.

  • Use dynamic programming to calculate the maximum amount you can win.

  • Keep track o...read more

Q2. Next Greater Element

You are given an array arr of length N. You have to return a list of integers containing the NGE(next greater element) of each element of the given array. The NGE for an element X is the fir...read more

Ans.

The task is to find the next greater element for each element in the given array.

  • Iterate through the array from right to left.

  • Use a stack to keep track of the next greater element.

  • Pop elements from the stack until a greater element is found or the stack is empty.

  • If the stack is empty, there is no greater element, so assign -1.

  • If a greater element is found, assign it as the next greater element.

  • Push the current element onto the stack.

  • Return the list of next greater elements.

Data Engineer Interview Questions and Answers for Freshers

illustration image
Q3. Search In Rotated Sorted Array

Aahad and Harshit always have fun by solving problems. Harshit took a sorted array and rotated it clockwise by an unknown amount. For example, he took a sorted array = [1, 2, 3, 4,...read more

Ans.

This is a problem where a sorted array is rotated and we need to search for given numbers in the array.

  • The array is rotated clockwise by an unknown amount.

  • We need to search for Q numbers in the rotated array.

  • If a number is found, we need to return its index, otherwise -1.

  • The search needs to be done in O(logN) time complexity.

  • The input consists of the size of the array, the array itself, the number of queries, and the queries.

Q4. Covid Vaccination

We are suffering from the Second wave of Covid-19. The Government is trying to increase its vaccination drives. Ninja wants to help the Government to plan an effective method to help increase v...read more

Ans.

This question asks for finding the maximum number of vaccines administered on a specific day during a vaccination drive, given the total number of days, total number of vaccines available, and the day number.

  • Read the number of test cases

  • For each test case, read the number of days, day number, and total number of vaccines available

  • Implement a logic to find the maximum number of vaccines administered on the given day number

  • Print the maximum number of vaccines administered for e...read more

Are these interview questions helpful?
Q5. K-th element of 2 sorted array

You are given two sorted arrays/list ‘arr1’ and ‘arr2’ and an integer k. You create a new sorted array by merging all the elements from ‘arr1’ and ‘arr2’. Your task is to find the ...read more

Ans.

The task is to find the kth smallest element of a merged array created by merging two sorted arrays.

  • Merge the two sorted arrays into a single sorted array

  • Return the kth element of the merged array

Q6. 1) If you are given a card with 1-1000 numbers and there are 4 boxes. Card no 1 will go in box 1 , card 2 in box 2 and similarly it will go. Card 5 will again go in box 1. So what will be the logic for this cod...

read more
Ans.

Logic for distributing cards among 4 boxes in a circular manner.

  • Use modulo operator to distribute cards among boxes in a circular manner.

  • If card number is divisible by 4, assign it to box 4.

  • If card number is divisible by 3, assign it to box 3.

  • If card number is divisible by 2, assign it to box 2.

  • If card number is not divisible by any of the above, assign it to box 1.

Share interview questions and help millions of jobseekers 🌟

man-with-laptop
Q7. Zigzag Binary Tree Traversal

Given a binary tree, return the zigzag level order traversal of the nodes' values of the given tree. Zigzag traversal means starting from left to right, then right to left for the ne...read more

Ans.

The zigzag level order traversal of a binary tree is the traversal of its nodes' values in an alternate left to right and right to left manner.

  • Perform a level order traversal of the binary tree

  • Use a queue to store the nodes at each level

  • For each level, alternate the direction of traversal

  • Store the values of the nodes in each level in separate arrays

  • Combine the arrays in alternate order to get the zigzag level order traversal

Q8. Write code for printing duplicate numbers in a list.

Ans.

Code to print duplicate numbers in a list.

  • Iterate through the list and keep track of the count of each number using a dictionary.

  • Print the numbers that have a count greater than 1.

Data Engineer Jobs

Senior Statistical Modeler - Data Engineer 4-10 years
Sanofi India Ltd
4.3
Hyderabad / Secunderabad
Data Engineer: Data Platforms 4-6 years
IBM India Pvt. Limited
4.1
New Delhi
Data Engineer: Data Platforms-Google 2-5 years
IBM India Pvt. Limited
4.1
Chennai
Q9. Next Greater Number

You are given a string S which represents a number. You have to find the smallest number strictly greater than the given number which contains the same set of digits as of the original number...read more

Ans.

The task is to find the smallest number greater than the given number, with the same set of digits.

  • Iterate through the digits of the given number from right to left.

  • Find the first digit that is smaller than the digit to its right.

  • Swap this digit with the smallest digit to its right that is greater than it.

  • Sort the digits to the right of the swapped digit in ascending order.

  • If no such digit is found, return -1.

Q10. How do you handle changing schema from source. What are the common issues faced in hadoop and how did you resolve it?

Ans.

Handling changing schema from source in Hadoop

  • Use schema evolution techniques like Avro or Parquet to handle schema changes

  • Implement a flexible ETL pipeline that can handle schema changes

  • Use tools like Apache NiFi to dynamically adjust schema during ingestion

  • Common issues include data loss, data corruption, and performance degradation

  • Resolve issues by implementing proper testing, monitoring, and backup strategies

Q11. 1. Design and code a scheduler for allocating meeting rooms for the given input of room counts and timestamps: Input : No of rooms : 2 Time and dutation: 12pm 30 min Output: yes. Everytime the code runs it shou...

read more
Ans.

Design and code a scheduler for allocating meeting rooms based on input of room counts and timestamps.

  • Create a table with columns for room number, start time, and end time

  • Use SQL queries to check for available slots and allocate rooms

  • Consider edge cases such as overlapping meetings and room availability

  • Use a loop to continuously check for available slots and allocate rooms

  • Implement error handling for invalid input

Q12. DBMS Question

Asked several database related questions, specifically sql related, that how joins are used etc.

Q13. Python - sum of digits till the result becomes single digit Ex: 479 - 20 - 2

Ans.

Python program to find the sum of digits till the result becomes a single digit.

  • Convert the number to a string and iterate through each digit.

  • Add the digits and store the result.

  • Repeat the process until the result becomes a single digit.

  • Return the single digit result.

Q14. Assume We had a PAN india Retail store because of which i have customer table in backend one is customer profile table and other is customer transaction table both will linked with customer id so what will the ...

read more
Ans.

Use left join for computationally efficient way to find customer names from customer profile and transaction tables.

  • Use left join to combine customer profile and transaction tables based on customer id

  • Left join will include all customers from profile table even if they don't have transactions

  • Subquery may be less efficient as it has to be executed for each row in the result set

Q15. SQL question to get name of employee whose salary is greater than average salary of the department

Ans.

SQL query to retrieve name of employee with salary greater than department average.

  • Calculate average salary of department using GROUP BY clause

  • Join employee and department tables using department ID

  • Filter employees with salary greater than department average

  • Select employee name

Q16. Sql - List employees whose age is greater than average age of all employees

Ans.

List employees whose age is greater than average age of all employees using SQL.

  • Calculate the average age of all employees using AVG() function.

  • Use WHERE clause to filter out employees whose age is greater than the average age.

  • Join the employee table with the age table to get the age of each employee.

  • Example: SELECT * FROM employees WHERE age > (SELECT AVG(age) FROM employees);

Q17. 1) How to handle data skewness in spark.

Ans.

Data skewness in Spark can be handled by partitioning, bucketing, or using salting techniques.

  • Partitioning the data based on a key column can distribute the data evenly across the nodes.

  • Bucketing can group the data into buckets based on a key column, which can improve join performance.

  • Salting involves adding a random prefix to the key column, which can distribute the data evenly.

  • Using broadcast joins for small tables can also help in reducing skewness.

  • Using dynamic allocation...read more

Q18. Write code to print reverse of a sentence word by word.

Ans.

Code to print reverse of a sentence word by word.

  • Split the sentence into words using space as delimiter

  • Store the words in an array

  • Print the words in reverse order

Q19. What all the optimisation are possible to reduce the overhead of reducing the reading part of large datasets in spark ?

Ans.

Optimizations like partitioning, caching, and using efficient file formats can reduce overhead in reading large datasets in Spark.

  • Partitioning data based on key can reduce the amount of data shuffled during joins and aggregations

  • Caching frequently accessed datasets in memory can avoid recomputation

  • Using efficient file formats like Parquet or ORC can reduce disk I/O and improve read performance

Q20. Write Pyspark code to read csv file and show top 10 records.

Ans.

Pyspark code to read csv file and show top 10 records.

  • Import the necessary libraries

  • Create a SparkSession

  • Read the CSV file using the SparkSession

  • Display the top 10 records using the show() method

Q21. Write a sql query to find the name of person who logged in last within each country from Person Table ?

Ans.

SQL query to find the name of person who logged in last within each country from Person Table

  • Use a subquery to find the max login time for each country

  • Join the Person table with the subquery on country and login time to get the name of the person

Q22. Design Question

Create an api for the message to receive at the other end of the url provided.

Q23. What to import data from RDMS via sqoop without primary key

Ans.

Use --split-by option in sqoop to import data from RDMS without primary key

  • Use --split-by option to specify a column to split the import into multiple mappers

  • Use --boundary-query option to specify a query to determine the range of values for --split-by column

  • Example: sqoop import --connect jdbc:mysql://localhost/mydb --username root --password password --table mytable --split-by id

  • Example: sqoop import --connect jdbc:mysql://localhost/mydb --username root --password password ...read more

Q24. In a word count spark program which command will run on driver and which will run on executor

Ans.

Commands that run on driver and executor in a word count Spark program.

  • The command to read the input file and create RDD will run on driver.

  • The command to split the lines and count the words will run on executor.

  • The command to aggregate the word counts and write the output will run on driver.

  • Driver sends tasks to executors and coordinates the overall job.

  • Executor processes the tasks assigned by the driver.

Q25. What are the optimization techniques applied in pyspark code?

Ans.

Optimization techniques in PySpark code include partitioning, caching, and using broadcast variables.

  • Partitioning data based on key columns to optimize join operations

  • Caching frequently accessed data in memory to avoid recomputation

  • Using broadcast variables to efficiently share small data across nodes

  • Using appropriate data types and avoiding unnecessary type conversions

  • Avoiding shuffling of data by using appropriate transformations and actions

  • Using appropriate data structures...read more

Q26. Difference between Coalesce and Repartition and In which case we are using it ?

Ans.

Coalesce is used to combine multiple small partitions into a larger one, while Repartition is used to increase or decrease the number of partitions in a DataFrame.

  • Coalesce reduces the number of partitions in a DataFrame by combining small partitions into larger ones.

  • Repartition increases or decreases the number of partitions in a DataFrame by shuffling the data across partitions.

  • Coalesce is more efficient than Repartition as it minimizes data movement.

  • Coalesce is typically us...read more

Q27. Write pyspark code to change column name, divide one column by another column.

Ans.

Pyspark code to change column name and divide one column by another column.

  • Use 'withColumnRenamed' method to change column name

  • Use 'withColumn' method to divide one column by another column

  • Example: df = df.withColumnRenamed('old_col_name', 'new_col_name').withColumn('new_col_name', df['col1']/df['col2'])

Q28. 1. Difference between shallow copy and deep copy 2. How can you merge two dataframes with different column names 3. Regex question to find all the characters and numbers of particular length size 4. Spark funda...

read more
Ans.

1. Shallow copy creates a new object but does not duplicate nested objects. Deep copy creates a new object and duplicates all nested objects. 2. Merging dataframes with different column names requires renaming columns. 3. Regex can be used to find characters and numbers of a specific length. 4. Spark fundamentals involve understanding distributed computing and data processing.

  • Shallow copy: new object with same references to nested objects. Deep copy: new object with duplicate...read more

Q29. Difference between Rank , Dense Rank and Row Number and when we are using each of them ?

Ans.

Rank assigns a unique rank to each row, Dense Rank assigns a unique rank to each distinct row, and Row Number assigns a unique number to each row.

  • Rank assigns the same rank to rows with the same value, leaving gaps in the ranking if there are ties.

  • Dense Rank assigns a unique rank to each distinct row, leaving no gaps in the ranking.

  • Row Number assigns a unique number to each row, without any regard for the values in the rows.

  • Rank is used when you want to see the ranking of eac...read more

Q30. What happens when we enforce the schema and when we manually define the schema in the code ?

Ans.

Enforcing the schema ensures data consistency and validation, while manually defining the schema in code allows for more flexibility and customization.

  • Enforcing the schema ensures that all data conforms to a predefined structure and format, preventing errors and inconsistencies.

  • Manually defining the schema in code allows for more flexibility in handling different data types and structures.

  • Enforcing the schema can be done using tools like Apache Avro or Apache Parquet, while m...read more

Q31. Difference between cache and persist, repartition and coalesce.

Ans.

Cache and persist are used to store data in memory. Repartition and coalesce are used to change the number of partitions.

  • Cache stores the data in memory for faster access while persist allows the user to choose the storage level.

  • Repartition increases the number of partitions while coalesce decreases the number of partitions.

  • Cache and persist are transformations while repartition and coalesce are actions.

  • Cache and persist are used for iterative algorithms while repartition and...read more

Q32. 5) How to create a kafka topic with replication factor 2

Ans.

To create a Kafka topic with replication factor 2, use the command line tool or Kafka API.

  • Use the command line tool 'kafka-topics.sh' with the '--replication-factor' flag set to 2.

  • Alternatively, use the Kafka API to create a topic with a replication factor of 2.

  • Ensure that the number of brokers in the Kafka cluster is greater than or equal to the replication factor.

  • Consider setting the 'min.insync.replicas' configuration property to 2 to ensure that at least two replicas are ...read more

Q33. If you want very less latency - which is better standalone or client mode?

Ans.

Client mode is better for very less latency due to direct communication with the cluster.

  • Client mode allows direct communication with the cluster, reducing latency.

  • Standalone mode requires an additional layer of communication, increasing latency.

  • Client mode is preferred for real-time applications where low latency is crucial.

Q34. How to add a column in dataframe ? How to rename the column in dataframe ?

Ans.

To add a column in a dataframe, use the 'withColumn' method. To rename a column, use the 'withColumnRenamed' method.

  • To add a column, use the 'withColumn' method with the new column name and the expression to compute the values for that column.

  • Example: df.withColumn('new_column', df['existing_column'] * 2)

  • To rename a column, use the 'withColumnRenamed' method with the current column name and the new column name.

  • Example: df.withColumnRenamed('old_column', 'new_column')

Q35. When a spark job is submitted, what happens at backend. Explain the flow.

Ans.

When a spark job is submitted, various steps are executed at the backend to process the job.

  • The job is submitted to the Spark driver program.

  • The driver program communicates with the cluster manager to request resources.

  • The cluster manager allocates resources (CPU, memory) to the job.

  • The driver program creates DAG (Directed Acyclic Graph) of the job stages and tasks.

  • Tasks are then scheduled and executed on worker nodes in the cluster.

  • Intermediate results are stored in memory o...read more

Q36. How many stages will create from the above code that I have written

Ans.

The number of stages created from the code provided depends on the specific code and its functionality.

  • The number of stages can vary based on the complexity of the code and the specific tasks being performed.

  • Stages may include data extraction, transformation, loading, and processing.

  • It is important to analyze the code and identify distinct stages to determine the total number.

Q37. 1. What is udf in Spark? 2. Write PySpark code to check the validity of mobile_number column

Ans.

UDF stands for User-Defined Function in Spark. It allows users to define their own functions to process data.

  • UDFs can be written in different programming languages like Python, Scala, and Java.

  • UDFs can be used to perform complex operations on data that are not available in built-in functions.

  • PySpark code to check the validity of mobile_number column can be written using regular expressions and the `regexp_extract` function.

  • Example: `df.select('mobile_number', regexp_extract('...read more

Q38. Elaboration of Spark optimization techniques. Types of transformations, shuffling.

Ans.

Spark optimization techniques include partitioning, caching, and using appropriate transformations.

  • Partitioning data can improve performance by reducing shuffling.

  • Caching frequently used data can reduce the need for recomputation.

  • Transformations like filter, map, and reduceByKey can be used to optimize data processing.

  • Shuffling can be minimized by using operations like reduceByKey instead of groupByKey.

  • Broadcasting small data can improve performance by reducing network traffi...read more

Q39. what is an internal and external table in Hive

Ans.

Internal tables store data within Hive's warehouse directory while external tables store data outside of it.

  • Internal tables are managed by Hive and are deleted when the table is dropped

  • External tables are not managed by Hive and data is not deleted when the table is dropped

  • Internal tables are faster for querying as data is stored within Hive's warehouse directory

  • External tables are useful for sharing data between different systems

  • Example: CREATE TABLE my_table (col1 INT, col2...read more

Q40. How do you do performance optimization in Spark. Tell how you did it in you project.

Ans.

Performance optimization in Spark involves tuning configurations, optimizing code, and utilizing caching.

  • Tune Spark configurations such as executor memory, number of executors, and shuffle partitions.

  • Optimize code by reducing unnecessary shuffles, using efficient transformations, and avoiding unnecessary data movements.

  • Utilize caching to store intermediate results in memory and avoid recomputation.

  • Example: In my project, I optimized Spark performance by increasing executor me...read more

Q41. What is a difference between dbms and rdbms

Ans.

DBMS is a software system to manage databases while RDBMS is a type of DBMS that stores data in a structured manner.

  • DBMS stands for Database Management System while RDBMS stands for Relational Database Management System.

  • DBMS can manage any type of database while RDBMS manages only relational databases.

  • DBMS does not enforce any specific data model while RDBMS enforces the relational data model.

  • Examples of DBMS include MongoDB and Cassandra while examples of RDBMS include MySQL...read more

Q42. What is data flow? Difference with ADF pipeline and data flow

Ans.

Data flow is a visual representation of data movement and transformation. ADF pipeline is a set of activities to move and transform data.

  • Data flow is a drag-and-drop interface to design data transformation logic

  • ADF pipeline is a set of activities to orchestrate data movement and transformation

  • Data flow is more flexible and powerful than ADF pipeline

  • Data flow can be used to transform data within a pipeline or as a standalone entity

Q43. What is Data Lake? Difference between data lake and data warehouse

Ans.

Data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed.

  • Data lake stores raw, unstructured data from various sources.

  • Data lake allows for storing large amounts of data without the need for a predefined schema.

  • Data lake is cost-effective for storing data that may not have a clear use case at the time of storage.

  • Data warehouse stores structured data for querying and analysis.

  • Data warehouse requires a predefined schema for d...read more

Q44. What will happen if job has failed in pipeline and data processing cycle is over?

Ans.

If a job fails in the pipeline and data processing cycle is over, it can lead to incomplete or inaccurate data.

  • Incomplete data may affect downstream processes and analysis

  • Data quality may be compromised if errors are not addressed

  • Monitoring and alerting systems should be in place to detect and handle failures

  • Re-running the failed job or implementing error handling mechanisms can help prevent issues in the future

Q45. Introduction Project flow Why did you use HBase in your project? How did you query for data in HBase? What was the purpose of Hive? What are external partitioned tables? Optimization done in your projects

Ans.

Discussion on project flow, HBase, Hive, external partitioned tables, and optimization in a Data Engineer interview.

  • Explained project flow and the reason for using HBase in the project

  • Discussed querying data in HBase and the purpose of Hive

  • Described external partitioned tables and optimization techniques used in the project

Q46. Merge two unsorted lists such that the output list is sorted. You are free to use inbuilt sorting functions to sort the input lists

Ans.

Merge two unsorted lists into a sorted list using inbuilt sorting functions.

  • Use inbuilt sorting functions to sort the input lists

  • Merge the sorted lists using a merge algorithm

  • Return the merged and sorted list

Q47. Write a python program to convert a number to words. For ex: i/p 123, o/p - One hundred twenty three

Ans.

Python program to convert a number to words.

  • Use a dictionary to map numbers to words.

  • Divide the number into groups of three digits and convert each group to words.

  • Handle special cases like zero, negative numbers, and numbers greater than or equal to one billion.

Q48. Given in an example how to treat different categorised values based on their frequency.

Ans.

Treating categorised values based on frequency involves grouping rare values together.

  • Identify rare values based on their frequency distribution

  • Group rare values together to reduce complexity

  • Consider creating a separate category for rare values

Q49. Explaination of current project architecture, Cloud services used in project and purpose of using them. Architecture of Spark,Hive

Ans.

Our project architecture uses Spark and Hive for data processing and storage respectively. We utilize AWS services such as S3, EMR, and Glue for scalability and cost-effectiveness.

  • Spark is used for distributed data processing and analysis

  • Hive is used for data warehousing and querying

  • AWS S3 is used for storing large amounts of data

  • AWS EMR is used for running Spark and Hive clusters

  • AWS Glue is used for ETL (Extract, Transform, Load) jobs

  • The purpose of using these services is to...read more

Q50. Why do we need a data warehouse, why can't we store in the normal transactional database.

Ans.

Data warehouses are designed for analytical queries and reporting, while transactional databases are optimized for transactional processing.

  • Data warehouses are optimized for read-heavy workloads, allowing for complex queries and reporting.

  • Transactional databases are optimized for write-heavy workloads, ensuring data integrity and consistency.

  • Data warehouses often store historical data for analysis, while transactional databases focus on current data for operational purposes.

  • D...read more

1
2
3
4
5
6
7
Next
Interview Tips & Stories
Ace your next interview with expert advice and inspiring stories

Interview experiences of popular companies

3.7
 • 10k Interviews
3.9
 • 7.8k Interviews
3.7
 • 7.3k Interviews
3.8
 • 5.4k Interviews
3.7
 • 5.2k Interviews
4.1
 • 4.9k Interviews
3.8
 • 4.6k Interviews
3.6
 • 2.3k Interviews
4.1
 • 2.3k Interviews
View all

Calculate your in-hand salary

Confused about how your in-hand salary is calculated? Enter your annual salary (CTC) and get your in-hand salary

Data Engineer Interview Questions
Share an Interview
Stay ahead in your career. Get AmbitionBox app
qr-code
Helping over 1 Crore job seekers every month in choosing their right fit company
65 L+

Reviews

4 L+

Interviews

4 Cr+

Salaries

1 Cr+

Users/Month

Contribute to help millions
Get AmbitionBox app

Made with ❤️ in India. Trademarks belong to their respective owners. All rights reserved © 2024 Info Edge (India) Ltd.

Follow us
  • Youtube
  • Instagram
  • LinkedIn
  • Facebook
  • Twitter