Senior Data Engineer

300+ Senior Data Engineer Interview Questions and Answers

Updated 11 Jul 2025

Asked in GoDaddy

5d ago

Q. Write an SQL query to find the second most ordered item in a category.

Ans.

Use a SQL query with a subquery to find the 2nd most ordered item in a category.

Use a subquery to rank items within each category based on the number of orders
Select the item with rank 2 within each category
Order the results by category and rank to get the 2nd most ordered item in each category

Asked in Infocusp

4d ago

Q. A, B, and C, one of them is a liar. A says B is a liar, B says C is a liar, and C says B is a liar. Who is the liar?

Ans.

In a logic puzzle, only one among A, B, and C is lying. Analyzing their statements reveals the liar's identity.

If A is telling the truth, then B is a liar, which makes C's statement true, leading to a contradiction.
If B is telling the truth, then C is a liar, which makes A's statement false, leading to a contradiction.
If C is telling the truth, then B is a liar, which aligns with A's statement, making A truthful.
Thus, the only consistent scenario is that B is the liar.

Asked in Persistent Systems

1w ago

Q. How do you decide on the number of cores and worker nodes?

Ans.

Cores and worker nodes are decided based on the workload requirements and scalability needs of the data processing system.

Consider the size and complexity of the data being processed
Evaluate the processing speed and memory requirements of the tasks
Take into account the parallelism and concurrency needed for efficient data processing
Monitor the system performance and adjust cores and worker nodes as needed

Asked in Publicis Sapient

1d ago

Q. Given an employee attendance table, write a SQL query to identify employees who have been absent for more than 10 consecutive days during their tenure.

Ans.

Query to find employees absent for more than 10 consecutive days in attendance records.

Use a window function to identify consecutive absence days.
Filter results where the count of consecutive absences exceeds 10.
Example SQL: SELECT employee_id FROM attendance WHERE status = 'absent' GROUP BY employee_id HAVING COUNT(*) > 10;
Consider using ROW_NUMBER() to assign a sequence to attendance records.

Are these interview questions helpful?

Asked in LTIMindtree

1w ago

Q. Given a table with customer data, including department and salary, how would you use PySpark to find the maximum salary paid to a customer in each department?

Ans.

Use PySpark to find the maximum salary paid to customers in each department from a given table.

Load the data into a DataFrame using spark.read.
Group the data by department using groupBy() method.
Use agg() function to calculate the maximum salary for each department.
Example: df.groupBy('department').agg({'salary': 'max'})
Show the results using show() method.

Asked in GoDaddy

2w ago

Q. Create facts and dimensions for Amazon orders in an order management case study.

Ans.

Designing a data model for Amazon orders with facts and dimensions for effective order management.

Fact Table: Order Facts - Contains metrics like order_id, total_amount, order_date, and customer_id.
Dimension Table: Customer Dimension - Includes customer_id, name, email, and address.
Dimension Table: Product Dimension - Contains product_id, product_name, category, and price.
Dimension Table: Time Dimension - Includes order_date, week, month, quarter, and year for time-based anal...read more

Senior Data Engineer Jobs

Senior Data Engineer • 10-15 years

Jio

•

4.1

Mumbai

Senior Data Engineer • 2-4 years

American Express India

•

4.1

Gurgaon / Gurugram

Senior Data Engineer • 10-15 years

Jio

•

4.1

₹ 35 L/yr - ₹ 50 L/yr

Mumbai

View all Senior Data Engineer jobs

Asked in Persistent Systems

3d ago

Q. Find the top 5 countries with the highest population using Spark and SQL.

Ans.

Use Spark and SQL to find the top 5 countries with the highest population.

Use Spark to load the data and perform data processing.
Use SQL queries to group by country and sum the population.
Order the results in descending order and limit to top 5.
Example: SELECT country, SUM(population) AS total_population FROM table_name GROUP BY country ORDER BY total_population DESC LIMIT 5

Asked in Micron Technology

3d ago

Q. Write a Spark program to find the word with the maximum length in a given string.

Ans.

Use Spark program to find word with maximum length in a given string

Split the string into words using space as delimiter
Map each word to its length
Find the word with maximum length using reduce operation

Share interview questions and help millions of jobseekers 🌟

Asked in Persistent Systems

2w ago

Q. What is a catalyst optimiser? How it works?

Ans.

A catalyst optimizer is a query optimization tool used in Apache Spark to improve performance by generating an optimal query plan.

Catalyst optimizer is a rule-based query optimization framework in Apache Spark.
It leverages rules to transform the logical query plan into a more optimized physical plan.
The optimizer applies various optimization techniques like predicate pushdown, constant folding, and join reordering.
By optimizing the query plan, it reduces the overall execution...read more

Asked in Persistent Systems

1w ago

Q. What is shuffling? How to Handle Shuffling?

Ans.

Shuffling is the process of redistributing data across partitions in a distributed computing environment.

Shuffling is necessary when data needs to be grouped or aggregated across different partitions.
It can be handled efficiently by minimizing the amount of data being shuffled and optimizing the partitioning strategy.
Techniques like partitioning, combiners, and reducers can help reduce the amount of shuffling in MapReduce jobs.

Asked in 7 Eleven

1w ago

Q. Given two strings, write a function to determine if the second string is an anagram of the first string without using the sorted function or counter function.

Ans.

Identify anagrams by counting character occurrences without using sorted or counter functions.

Create a frequency dictionary for each string's characters.
Compare the frequency dictionaries of the two strings.
Example: 'listen' and 'silent' both have 1 'l', 1 'i', 1 's', 1 't', 1 'e', 1 'n'.
If the dictionaries match, the strings are anagrams.

Asked in LTIMindtree

1w ago

Q. How do you handle missing data in a PySpark DataFrame?

Ans.

Handle missing data in pyspark dataframe by using functions like dropna, fillna, or replace.

Use dropna() function to remove rows with missing data
Use fillna() function to fill missing values with a specified value
Use replace() function to replace missing values with a specified value

Asked in TCS

3d ago

Q. What are the different types of joins in SQL? Please provide an example to elaborate.

Ans.

Different types of joins in SQL include inner join, left join, right join, and full outer join.

Inner join: Returns rows when there is a match in both tables.
Left join: Returns all rows from the left table and the matched rows from the right table.
Right join: Returns all rows from the right table and the matched rows from the left table.
Full outer join: Returns rows when there is a match in either table.
Example: SELECT * FROM table1 INNER JOIN table2 ON table1.id = table2.id;

Asked in Micron Technology

1w ago

Q. Write an SQL query to change rows into columns and vice versa.

Ans.

Use SQL pivot function to change rows into columns and vice versa

Use the PIVOT function in SQL to transform rows into columns
Use the UNPIVOT function in SQL to transform columns into rows
Example: SELECT * FROM table_name PIVOT (SUM(value) FOR column_name IN (value1, value2, value3))
Example: SELECT * FROM table_name UNPIVOT (value FOR column_name IN (value1, value2, value3))

Asked in LTIMindtree

1w ago

Q. In Databricks, when a Spark job is submitted, what happens at the backend? Explain the flow.

Ans.

When a spark is submitted in Databricks, several backend processes are triggered to execute the job.

The submitted spark job is divided into tasks by the Spark driver.
The tasks are then scheduled to run on the available worker nodes in the cluster.
The worker nodes execute the tasks and return the results to the driver.
The driver aggregates the results and presents them to the user.
Various optimizations such as data shuffling and caching may be applied during the execution proc...read more

Asked in Persistent Systems

2d ago

Q. Using two tables, how would you identify the different records resulting from different types of joins?

Ans.

To find different records for different joins using two tables

Use the SQL query to perform different joins like INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL JOIN
Identify the key columns in both tables to join on
Select the columns from both tables and use WHERE clause to filter out the different records

Asked in Persistent Systems

2w ago

Q. How do you handle incremental data?

Ans.

Incremental data is handled by identifying new data since the last update and merging it with existing data.

Identify new data since last update
Merge new data with existing data
Update data warehouse or database with incremental changes

Asked in Persistent Systems

5d ago

Q. What happens when we enforce schema?

Ans.

Enforcing schema ensures that data conforms to a predefined structure and rules.

Ensures data integrity by validating incoming data against predefined schema
Helps in maintaining consistency and accuracy of data
Prevents data corruption and errors in data processing
Can lead to rejection of data that does not adhere to the schema

Asked in Publicis Sapient

1w ago

Q. What is the purpose of the Spark Submit command in Apache Spark?

Ans.

Spark Submit command is used to submit Spark applications to a cluster.

Spark Submit command is used to launch applications on a Spark cluster.
It allows users to specify application parameters like main class, jars, and arguments.
Users can also configure properties like memory allocation and number of executors.
Example: spark-submit --class com.example.Main --master yarn --deploy-mode cluster myApp.jar

Asked in Principal Global Services

2w ago

Q. End to End project Architecture and data pipeline working

Ans.

End to end project architecture involves designing and implementing a data pipeline to process and analyze data from various sources.

Define project requirements and goals
Design data architecture including data sources, storage, processing, and analytics tools
Implement data pipeline to extract, transform, and load data
Ensure data quality and consistency throughout the pipeline
Monitor and optimize performance of the data pipeline
Examples: Using Apache Kafka for real-time data s...read more

Asked in Publicis Sapient

2w ago

Q. What are the common file formats used in data storages? Which one is best for compression?

Ans.

Common file formats used in data storages include CSV, JSON, Parquet, Avro, and ORC. Parquet is best for compression.

CSV (Comma-Separated Values) - simple and widely used, but not efficient for large datasets
JSON (JavaScript Object Notation) - human-readable and easy to parse, but can be inefficient for storage
Parquet - columnar storage format that is highly efficient for compression and query performance
Avro - efficient binary format with schema support, good for data serial...read more

Asked in Pattern Technologies

1w ago

Q. How would you design a review analytics dashboard for a client partner?

Ans.

I would design a review analytics dashboard by integrating data from various sources, creating visualizations for key metrics, and allowing for customization and interactivity.

Integrate data from various sources such as customer reviews, ratings, and feedback.
Create visualizations for key metrics like average rating, sentiment analysis, and review volume over time.
Allow for customization by letting users filter and drill down into specific data points.
Enable interactivity by ...read more

Asked in Doux Consultant

2w ago

Q. What is the architecture of Apache Spark, and how can one write code to read, transform, and write data to Redshift?

Ans.

Apache Spark is a distributed computing framework for big data processing, enabling fast data analysis and transformation.

Resilient Distributed Datasets (RDDs): Spark's core abstraction for distributed data, allowing fault tolerance and parallel processing.
DataFrames and Datasets: Higher-level abstractions built on RDDs, providing optimized execution and easier data manipulation with SQL-like operations.
Spark SQL: Enables querying structured data using SQL, allowing integrati...read more

Asked in Micron Technology

3d ago

Q. Write an SQL query to fetch duplicate rows in a table.

Ans.

SQL query to fetch duplicate rows in a table

Use GROUP BY and HAVING clause to identify duplicate rows
Select columns to check for duplicates
Example: SELECT column1, column2, COUNT(*) FROM table_name GROUP BY column1, column2 HAVING COUNT(*) > 1;

Asked in LTIMindtree

1w ago

Q. SQL what are the condition used in sql? when we have table but we want create

Ans.

SQL conditions are used to filter data based on specified criteria. Common conditions include WHERE, AND, OR, IN, BETWEEN, etc.

Common SQL conditions include WHERE, AND, OR, IN, BETWEEN, LIKE, etc.
Conditions are used to filter data based on specified criteria in SQL queries.
Examples: WHERE salary > 50000, AND department = 'IT', OR age < 30

Asked in MOURI Tech

1w ago

Q. How do you find duplicate items in a list?

Ans.

To find duplicate items in a list

Iterate through the list and compare each item with the rest of the list
Use a hash table to keep track of seen items
Sort the list and compare adjacent items

Asked in Virtusa Consulting Services

2d ago

Q. What do you know about Spark architecture?

Ans.

Spark architecture is based on a master-slave architecture with a cluster manager to coordinate tasks.

Spark architecture consists of a driver program that communicates with a cluster manager to coordinate tasks.
The cluster manager allocates resources and schedules tasks on worker nodes.
Worker nodes execute the tasks and return results to the driver program.
Spark supports various cluster managers like YARN, Mesos, and standalone mode.
Spark applications can run in standalone mo...read more

Asked in Amazon

1w ago

Q. Tell me about yourself; Projects done; What is Columnar format file in Spark; Internals of Spark, Difference between OLAP and OLTP; About Datawarehouse- facts, dimensions

Ans.

I am a Senior Data Engineer with experience in various projects involving columnar format files in Spark, understanding Spark internals, OLAP vs OLTP, and data warehousing concepts.

Projects: Developed ETL pipelines using Spark for processing large datasets, implemented data quality checks, and optimized query performance.
Columnar format file in Spark: It stores data in columnar format to improve query performance by reading only the required columns, like Parquet or ORC files...read more

Asked in Aays

5d ago

Q. Given an existing architecture and a specific error, what steps would you take to debug the use case?

Ans.

First step in use case debugging for existing architecture with error

Review the existing architecture documentation to understand the flow and components involved
Check the error logs to identify the specific error message and its source
Verify the data inputs and outputs to ensure they are correct and consistent
Run tests to reproduce the error and gather more information for analysis
Consult with team members or stakeholders for insights or previous experiences with similar iss...read more

Asked in GoDaddy

1w ago

Q. Write a Python program to print all even numbers in a given range.

Ans.

This task involves printing even numbers within a specified range using Python.

Use the range() function to define the range of numbers.
Utilize a for loop to iterate through the numbers in the range.
Check if a number is even using the modulus operator (%).
Print the number if it is even.
Example: for i in range(1, 11): if i % 2 == 0: print(i) will print 2, 4, 6, 8, 10.