Data Engineer

1000+ Data Engineer Interview Questions and Answers

Updated 1 Jul 2025

Asked in Amazon

3d ago

Q. Next Greater Element Problem Statement

You are given an array arr of length N. For each element in the array, find the next greater element (NGE) that appears to the right. If there is no such greater element, ...read more

Ans.

The task is to find the next greater element for each element in the given array.

Iterate through the array from right to left.
Use a stack to keep track of the next greater element.
Pop elements from the stack until a greater element is found or the stack is empty.
If the stack is empty, there is no greater element, so assign -1.
If a greater element is found, assign it as the next greater element.
Push the current element onto the stack.
Return the list of next greater elements.

Asked in LTIMindtree

3d ago

Q. If you are given cards numbered 1-1000 and 4 boxes, where card 1 goes in box 1, card 2 in box 2, and so on, with card 5 going back into box 1, what is the logic for this code?

Ans.

Logic for distributing cards among 4 boxes in a circular manner.

Use modulo operator to distribute cards among boxes in a circular manner.
If card number is divisible by 4, assign it to box 4.
If card number is divisible by 3, assign it to box 3.
If card number is divisible by 2, assign it to box 2.
If card number is not divisible by any of the above, assign it to box 1.

Data Engineer Interview Questions and Answers for Freshers

View all interview questions

Asked in LinkedIn

5d ago

Q. Optimal Strategy for a Coin Game

You are playing a coin game with your friend Ninjax. There are N coins placed in a straight line.

Here are the rules of the game:

1. Each coin has a value associated with it.
 2....read more

Ans.

The task is to find the maximum amount you can definitely win in a game of coins against an opponent who plays optimally.

The game is played with alternating turns, and each player can pick the first or last coin from the line.
The value associated with the picked coin adds up to the total amount the player wins.
To maximize your winnings, you need to consider all possible combinations of coin picks.
Use dynamic programming to calculate the maximum amount you can win.
Keep track o...read more

Asked in Amazon

2d ago

Q. Problem: Search In Rotated Sorted Array

Given a sorted array that has been rotated clockwise by an unknown amount, you need to answer Q queries. Each query is represented by an integer Q[i], and you must determ...read more

Ans.

This is a problem where a sorted array is rotated and we need to search for given numbers in the array.

The array is rotated clockwise by an unknown amount.
We need to search for Q numbers in the rotated array.
If a number is found, we need to return its index, otherwise -1.
The search needs to be done in O(logN) time complexity.
The input consists of the size of the array, the array itself, the number of queries, and the queries.

Are these interview questions helpful?

Asked in Sigmoid

6d ago

Q. K-th Element of Two Sorted Arrays

You are provided with two sorted arrays, arr1 and arr2, along with an integer k. By merging all elements from arr1 and arr2 into a new sorted array, your task is to identify th...read more

Ans.

The task is to find the kth smallest element of a merged array created by merging two sorted arrays.

Merge the two sorted arrays into a single sorted array
Return the kth element of the merged array

Asked in Cisco

5d ago

Q. Covid Vaccination Distribution Problem

As the Government ramps up vaccination drives to combat the second wave of Covid-19, you are tasked with helping plan an effective vaccination schedule. Your goal is to ma...read more

Ans.

This question asks for finding the maximum number of vaccines administered on a specific day during a vaccination drive, given the total number of days, total number of vaccines available, and the day number.

Read the number of test cases
For each test case, read the number of days, day number, and total number of vaccines available
Implement a logic to find the maximum number of vaccines administered on the given day number
Print the maximum number of vaccines administered for e...read more

Data Engineer Jobs

Data Engineer • 8-13 years

SANOFI HEALTHCARE INDIA PRIVATE LIMITED

•

4.2

Hyderabad / Secunderabad

Data Engineer-Data Platforms • 5-10 years

IBM India Pvt. Limited

•

4.0

₹ 7 L/yr - ₹ 21 L/yr

(AmbitionBox estimate)

Mumbai

Data Engineer-Data Platforms • 2-5 years

IBM India Pvt. Limited

•

4.0

₹ 4 L/yr - ₹ 16 L/yr

(AmbitionBox estimate)

Mumbai

View all Data Engineer jobs

Asked in Adobe

4d ago

Q. Zigzag Binary Tree Traversal Problem Statement

Determine the zigzag level order traversal of a given binary tree's nodes. Zigzag traversal alternates the direction at each level, starting from left to right, th...read more

Ans.

The zigzag level order traversal of a binary tree is the traversal of its nodes' values in an alternate left to right and right to left manner.

Perform a level order traversal of the binary tree
Use a queue to store the nodes at each level
For each level, alternate the direction of traversal
Store the values of the nodes in each level in separate arrays
Combine the arrays in alternate order to get the zigzag level order traversal

Asked in Goldman Sachs

3d ago

Q. Next Greater Number Problem Statement

Given a string S which represents a number, determine the smallest number strictly greater than the original number composed of the same digits. Each digit's frequency from...read more

Ans.

The task is to find the smallest number greater than the given number, with the same set of digits.

Iterate through the digits of the given number from right to left.
Find the first digit that is smaller than the digit to its right.
Swap this digit with the smallest digit to its right that is greater than it.
Sort the digits to the right of the swapped digit in ascending order.
If no such digit is found, return -1.

Share interview questions and help millions of jobseekers 🌟

Asked in KPMG India

3d ago

Q. How do you handle changing schema from source. What are the common issues faced in hadoop and how did you resolve it?

Ans.

Handling changing schema from source in Hadoop

Use schema evolution techniques like Avro or Parquet to handle schema changes
Implement a flexible ETL pipeline that can handle schema changes
Use tools like Apache NiFi to dynamically adjust schema during ingestion
Common issues include data loss, data corruption, and performance degradation
Resolve issues by implementing proper testing, monitoring, and backup strategies

Asked in Accenture

6d ago

Q. What optimizations are possible to reduce the overhead of reading large datasets in Spark?

Ans.

Optimizations like partitioning, caching, and using efficient file formats can reduce overhead in reading large datasets in Spark.

Partitioning data based on key can reduce the amount of data shuffled during joins and aggregations
Caching frequently accessed datasets in memory can avoid recomputation
Using efficient file formats like Parquet or ORC can reduce disk I/O and improve read performance

Asked in Bajaj Finserv

3d ago

Q. How would you design an API to receive messages at the specified URL?

Ans.

Designing an API to receive messages at a specified URL

Define the endpoint URL for receiving messages
Implement authentication and authorization mechanisms to ensure secure access
Use HTTP methods like POST for sending messages to the API
Include error handling and response codes for different scenarios
Consider implementing message queuing for handling high volumes of messages

Asked in Walmart

4d ago

Q. 1. Design and code a scheduler for allocating meeting rooms for the given input of room counts and timestamps: Input : No of rooms : 2 Time and dutation: 12pm 30 min Output: yes. Everytime the code runs it shou...

Ans.

Design and code a scheduler for allocating meeting rooms based on input of room counts and timestamps.

Create a table with columns for room number, start time, and end time
Use SQL queries to check for available slots and allocate rooms
Consider edge cases such as overlapping meetings and room availability
Use a loop to continuously check for available slots and allocate rooms
Implement error handling for invalid input

Asked in IBM

5d ago

Q. How do you handle data skewness in Spark?

Ans.

Data skewness in Spark can be handled by partitioning, bucketing, or using salting techniques.

Partitioning the data based on a key column can distribute the data evenly across the nodes.
Bucketing can group the data into buckets based on a key column, which can improve join performance.
Salting involves adding a random prefix to the key column, which can distribute the data evenly.
Using broadcast joins for small tables can also help in reducing skewness.
Using dynamic allocation...read more

Asked in Futurense Technologies

1d ago

Q. Write a Python function to calculate the sum of digits of a number until the result becomes a single digit. For example, given the number 479, the sum of digits is 4 + 7 + 9 = 20, and then the sum of digits of...

Ans.

Python program to find the sum of digits till the result becomes a single digit.

Convert the number to a string and iterate through each digit.
Add the digits and store the result.
Repeat the process until the result becomes a single digit.
Return the single digit result.

Asked in KPMG India

6d ago

Q. Write PySpark code to read a CSV file and display the top 10 records.

Ans.

Pyspark code to read csv file and show top 10 records.

Import the necessary libraries
Create a SparkSession
Read the CSV file using the SparkSession
Display the top 10 records using the show() method

Asked in bp

2d ago

Q. Write a SQL query to get the name of the employee whose salary is greater than the average salary of their department.

Ans.

SQL query to retrieve name of employee with salary greater than department average.

Calculate average salary of department using GROUP BY clause
Join employee and department tables using department ID
Filter employees with salary greater than department average
Select employee name

Asked in TransOrg Analytics

3d ago

Q. Given a customer profile table and a customer transaction table linked by customer ID, what is the most computationally efficient approach (left join or subquery) to find the names of customers who have made tr...

Ans.

Use left join for computationally efficient way to find customer names from customer profile and transaction tables.

Use left join to combine customer profile and transaction tables based on customer id
Left join will include all customers from profile table even if they don't have transactions
Subquery may be less efficient as it has to be executed for each row in the result set

Asked in Accenture

3d ago

Q. Write a SQL query to find the name of the person who logged in last within each country from the Person Table.

Ans.

SQL query to find the name of person who logged in last within each country from Person Table

Use a subquery to find the max login time for each country
Join the Person table with the subquery on country and login time to get the name of the person

Asked in Futurense Technologies

4d ago

Q. Write a SQL query to list employees whose age is greater than the average age of all employees.

Ans.

List employees whose age is greater than average age of all employees using SQL.

Calculate the average age of all employees using AVG() function.
Use WHERE clause to filter out employees whose age is greater than the average age.
Join the employee table with the age table to get the age of each employee.
Example: SELECT * FROM employees WHERE age > (SELECT AVG(age) FROM employees);

Asked in Accenture

1d ago

Q. What is the difference between Coalesce and Repartition, and in which cases would you use them?

Ans.

Coalesce is used to combine multiple small partitions into a larger one, while Repartition is used to increase or decrease the number of partitions in a DataFrame.

Coalesce reduces the number of partitions in a DataFrame by combining small partitions into larger ones.
Repartition increases or decreases the number of partitions in a DataFrame by shuffling the data across partitions.
Coalesce is more efficient than Repartition as it minimizes data movement.
Coalesce is typically us...read more

Asked in Fractal Analytics

5d ago

Q. In a word count Spark program, which commands will run on the driver and which will run on the executor?

Ans.

Commands that run on driver and executor in a word count Spark program.

The command to read the input file and create RDD will run on driver.
The command to split the lines and count the words will run on executor.
The command to aggregate the word counts and write the output will run on driver.
Driver sends tasks to executors and coordinates the overall job.
Executor processes the tasks assigned by the driver.

Asked in LTIMindtree

2d ago

Q. When a Spark job is submitted, what happens at the backend? Explain the flow.

Ans.

When a spark job is submitted, various steps are executed at the backend to process the job.

The job is submitted to the Spark driver program.
The driver program communicates with the cluster manager to request resources.
The cluster manager allocates resources (CPU, memory) to the job.
The driver program creates DAG (Directed Acyclic Graph) of the job stages and tasks.
Tasks are then scheduled and executed on worker nodes in the cluster.
Intermediate results are stored in memory o...read more

Asked in KPMG India

1d ago

Q. What are the optimization techniques applied in PySpark code?

Ans.

Optimization techniques in PySpark code include partitioning, caching, and using broadcast variables.

Partitioning data based on key columns to optimize join operations
Caching frequently accessed data in memory to avoid recomputation
Using broadcast variables to efficiently share small data across nodes
Using appropriate data types and avoiding unnecessary type conversions
Avoiding shuffling of data by using appropriate transformations and actions
Using appropriate data structures...read more

Asked in LTIMindtree

1d ago

Q. For minimal latency, is standalone or client mode preferable?

Ans.

Client mode is better for very less latency due to direct communication with the cluster.

Client mode allows direct communication with the cluster, reducing latency.
Standalone mode requires an additional layer of communication, increasing latency.
Client mode is preferred for real-time applications where low latency is crucial.

Asked in Altimetrik

4d ago

Q. How would you convert a multi-line string with 'n' words into multiple arrays of a fixed size 'k', ensuring no overlap of elements across arrays?

Ans.

Convert a string of multiple lines with 'n' words to multiple arrays of fixed size without overlap.

Split the string into individual words
Create arrays of fixed size 'k' and distribute words evenly
Handle cases where the number of words is not divisible by 'k'

Asked in Accenture

6d ago

Q. How do you import data from an RDBMS via Sqoop when there is no primary key?

Ans.

Use --split-by option in sqoop to import data from RDMS without primary key

Use --split-by option to specify a column to split the import into multiple mappers
Use --boundary-query option to specify a query to determine the range of values for --split-by column
Example: sqoop import --connect jdbc:mysql://localhost/mydb --username root --password password --table mytable --split-by id
Example: sqoop import --connect jdbc:mysql://localhost/mydb --username root --password password ...read more

Asked in Accenture

3d ago

Q. What are the differences between Rank, Dense Rank, and Row Number, and when should each be used?

Ans.

Rank assigns a unique rank to each row, Dense Rank assigns a unique rank to each distinct row, and Row Number assigns a unique number to each row.

Rank assigns the same rank to rows with the same value, leaving gaps in the ranking if there are ties.
Dense Rank assigns a unique rank to each distinct row, leaving no gaps in the ranking.
Row Number assigns a unique number to each row, without any regard for the values in the rows.
Rank is used when you want to see the ranking of eac...read more

Asked in Accenture

5d ago

Q. How to add a column in dataframe ? How to rename the column in dataframe ?

Ans.

To add a column in a dataframe, use the 'withColumn' method. To rename a column, use the 'withColumnRenamed' method.

To add a column, use the 'withColumn' method with the new column name and the expression to compute the values for that column.
Example: df.withColumn('new_column', df['existing_column'] * 2)
To rename a column, use the 'withColumnRenamed' method with the current column name and the new column name.
Example: df.withColumnRenamed('old_column', 'new_column')

Asked in Crisil

5d ago

Q. 1. Difference between shallow copy and deep copy 2. How can you merge two dataframes with different column names 3. Regex question to find all the characters and numbers of particular length size 4. Spark funda...

Ans.

1. Shallow copy creates a new object but does not duplicate nested objects. Deep copy creates a new object and duplicates all nested objects. 2. Merging dataframes with different column names requires renaming columns. 3. Regex can be used to find characters and numbers of a specific length. 4. Spark fundamentals involve understanding distributed computing and data processing.

Shallow copy: new object with same references to nested objects. Deep copy: new object with duplicate...read more

Asked in KPMG India

1d ago

Q. Write PySpark code to change a column name and divide one column by another.

Ans.

Pyspark code to change column name and divide one column by another column.

Use 'withColumnRenamed' method to change column name
Use 'withColumn' method to divide one column by another column
Example: df = df.withColumnRenamed('old_col_name', 'new_col_name').withColumn('new_col_name', df['col1']/df['col2'])