Senior Data Engineer

200+ Senior Data Engineer Interview Questions and Answers

Updated 15 Jan 2025
search-icon

Q1. Write a query to get the customer with the highest total order value for each year, month. [Note: Order table is different and Customer table is different. Order_ID and Customer_ID are the PK of the table with...

read more
Ans.

Query to get the customer with the highest total order value for each year, month.

  • Join the Order and Customer tables on the foreign key

  • Group the results by year, month, and customer

  • Calculate the total order value for each group

  • Find the maximum total order value for each year, month

  • If there are multiple customers with the same highest total order value, select the one with the lower Customer_ID

Q2. There are 10 million records in the table and the schema does not contain the ModifiedDate column. One cell was modified the next day in the table. How will you fetch that particular information that needs to b...

read more
Ans.

To fetch the modified information without ModifiedDate column from a table with 10 million records.

  • Create a trigger to capture the modified information and insert it into a separate table with ModifiedDate column.

  • Use a tool like Change Data Capture (CDC) to track changes in the table and extract the modified information.

  • Use a query to compare the current table with a backup copy taken the previous day to identify the modified information.

Senior Data Engineer Interview Questions and Answers for Freshers

illustration image

Q3. How do you handle data pipeline when the schema information keeps changing at the source?

Ans.

Handle changing schema by using schema evolution techniques and version control.

  • Use schema evolution techniques like adding new fields, renaming fields, and changing data types.

  • Implement version control to track changes and ensure backward compatibility.

  • Use tools like Apache Avro or Apache Parquet to store data in a self-describing format.

  • Implement automated testing to ensure data quality and consistency.

  • Collaborate with data producers to establish clear communication and doc...read more

Q4. Difference between Parquet and ORC file. Why industry uses parquet over ORC? Can schema evolution happen in ORC?

Ans.

Parquet and ORC are columnar storage formats. Parquet is preferred due to its cross-platform support and better compression. ORC supports schema evolution.

  • Parquet is a columnar storage format that is widely used in the industry due to its cross-platform support and better compression.

  • ORC is another columnar storage format that supports schema evolution.

  • Parquet is preferred over ORC due to its better compression and support for a wider range of programming languages.

  • ORC is pre...read more

Are these interview questions helpful?

Q5. What is Normalisation and Denormalisation? When do we use them? Give a real-time example that is implemented in your project.

Ans.

Normalisation is the process of organizing data in a database to reduce redundancy and improve data integrity. Denormalisation is the opposite process.

  • Normalisation is used to eliminate data redundancy and improve data integrity.

  • Denormalisation is used to improve query performance by reducing the number of joins required.

  • A real-time example of normalisation is breaking down a customer's information into separate tables such as customer details, order details, and payment deta...read more

Q6. Given infinite coins of some currency of denominations : 1,2,5,10, so In how many unique distinct ways can we obtain a total amount of say: 25 ? Same thing how to do it for non-unique repeated combinations or p...

read more
Ans.

There are multiple ways to obtain a total amount of 25 using coins of denominations 1, 2, 5, and 10. The question asks for unique distinct ways and non-unique repeated combinations or permutations.

  • For unique distinct ways, you can use dynamic programming to calculate the number of ways to reach the target amount.

  • For non-unique repeated combinations or permutations, you can use a recursive function to generate all possible combinations.

  • Example for unique distinct ways: [1, 2, ...read more

Share interview questions and help millions of jobseekers 🌟

man-with-laptop

Q7. What are the different types of schema you know in Data Warehousing?

Ans.

There are three types of schema in Data Warehousing: Star Schema, Snowflake Schema, and Fact Constellation Schema.

  • Star Schema: central fact table connected to dimension tables in a star shape

  • Snowflake Schema: extension of star schema with normalized dimension tables

  • Fact Constellation Schema: multiple fact tables connected to dimension tables in a complex structure

Q8. Given a non-decreasing array, how can I determine the indices of an element X within it? If the element is not present, the output should be [-1, -1]. For example, for the array [1,2,3,3,5,5,7,8] and X=5, the e...

read more
Ans.

Find indices of an element in a non-decreasing array

  • Iterate through the array and keep track of the indices where the element X is found

  • Return the list of indices or [-1, -1] if element X is not found

  • Handle edge cases like empty array or X not present in the array

Senior Data Engineer Jobs

Senior Data Engineer 8-10 years
PEPSICO GLOBAL BUSINESS SERVICES INDIA LLP
4.1
Hyderabad / Secunderabad
Senior Data Engineer 5-9 years
S&P Global Inc.
4.2
Hyderabad / Secunderabad
Senior Data Engineer 5-8 years
Teleperformance (TP)
3.9
Hyderabad / Secunderabad

Q9. How many stages will be created if a spark job has 3 wide transformations and 2 narrow transformations?

Ans.

There will be 4 stages created in total for the spark job.

  • Wide transformations trigger a shuffle and create a new stage.

  • Narrow transformations do not trigger a shuffle and do not create a new stage.

  • In this case, 3 wide transformations will create 3 new stages and 2 narrow transformations will not create new stages.

  • Therefore, a total of 4 stages will be created.

Q10. What is the best approach to finding whether the data frame is empty or not?

Ans.

Use the len() function to check the length of the data frame.

  • Use len() function to get the number of rows in the data frame.

  • If the length is 0, then the data frame is empty.

  • Example: if len(df) == 0: print('Data frame is empty')

Q11. How do u calculate the resources based on the number of cores and memory given..ex - 16 cores with 64 gb... What is overhead, driver memory etc

Ans.

Calculating resources based on cores and memory given with overhead and driver memory

  • Calculate the total memory available by multiplying the number of cores with memory per core

  • Deduct the overhead memory required for the operating system and other processes

  • Allocate driver memory for each executor based on the workload

  • Consider the memory requirements for other services like Hadoop, Spark, etc.

  • Example: For 16 cores with 64 GB memory, assuming 1 GB overhead and 2 GB driver memor...read more

Q12. Big data Hadoop architecture and HDFS commands to copy and list files in hdfs spark architecture and Transformation and Action question what happen when we submit spark program spark dataframe coding question s...

read more
Ans.

Questions on big data, Hadoop, Spark, Scala, Git, project and Agile.

  • Hadoop architecture and HDFS commands for copying and listing files in HDFS

  • Spark architecture and Transformation and Action question

  • What happens when we submit a Spark program

  • Spark DataFrame coding question

  • Scala basic program on List

  • Git and Github

  • Project-related question

  • Agile-related

Q13. Explain Transaction Isolation, and what are the various types of Transaction isolation in RDBMS ?

Ans.

Transaction isolation is a concept in databases that ensures transactions are executed independently of each other.

  • Transaction isolation levels determine the degree to which one transaction must be isolated from other transactions.

  • Types of transaction isolation levels include Read Uncommitted, Read Committed, Repeatable Read, and Serializable.

  • Each isolation level offers a different level of consistency and concurrency control.

  • For example, in Read Uncommitted isolation level, ...read more

Q14. how would you pass connections string if lambda is connecting to database

Ans.

Pass connection string as environment variable or use AWS Secrets Manager

  • Store connection string as environment variable in Lambda function configuration

  • Retrieve connection string from AWS Secrets Manager and use it in Lambda function

  • Use IAM role to grant Lambda function access to database

  • Encrypt connection string using AWS KMS for added security

Q15. What is the difference between Tasks, and stages? About Spark UI?

Ans.

Tasks and stages are components of the execution plan in Spark UI.

  • Tasks are the smallest unit of work in Spark, representing a single operation on a partition of data.

  • Stages are groups of tasks that are executed together as part of a larger computation.

  • Tasks within a stage can be executed in parallel, while stages are executed sequentially.

  • Tasks are created based on the transformations and actions in the Spark application.

  • Stages are created based on the dependencies between R...read more

Q16. how to migrate 1000s of tables using spark(databricks) notebooks

Ans.

Use Spark (Databricks) notebooks to migrate 1000s of tables efficiently.

  • Utilize Spark's parallel processing capabilities to handle large volumes of data

  • Leverage Databricks notebooks for interactive data exploration and transformation

  • Automate the migration process using scripts or workflows

  • Optimize performance by tuning Spark configurations and cluster settings

Q17. What is the difference between repartition and Coelsce?

Ans.

Repartition increases or decreases the number of partitions in a DataFrame, while Coalesce only decreases the number of partitions.

  • Repartition can increase or decrease the number of partitions in a DataFrame, leading to a shuffle of data across the cluster.

  • Coalesce only decreases the number of partitions in a DataFrame without performing a full shuffle, making it more efficient than repartition.

  • Repartition is typically used when there is a need to increase the number of parti...read more

Q18. Python Coding question : without python methods 1. to check if a list is sorted 2. sort the list , optimize the solution

Ans.

Check if a list is sorted and sort the list without using Python methods.

  • To check if a list is sorted, iterate through the list and compare each element with the next one. If any element is greater than the next one, the list is not sorted.

  • To sort the list without using Python methods, implement a sorting algorithm like bubble sort, selection sort, or insertion sort.

  • Example for checking if a list is sorted: ['a', 'b', 'c'] is sorted, ['c', 'b', 'a'] is not sorted.

  • Example for ...read more

Q19. What are optimisation techniques used in the project?

Ans.

Optimisation techniques used in the project include indexing, query optimization, caching, and parallel processing.

  • Indexing: Creating indexes on frequently queried columns to improve search performance.

  • Query optimization: Rewriting queries to make them more efficient and reduce execution time.

  • Caching: Storing frequently accessed data in memory to reduce the need for repeated database queries.

  • Parallel processing: Distributing tasks across multiple processors to speed up data p...read more

Q20. what are the optimization techniques used in your project

Ans.

Optimization techniques used in project

  • Caching

  • Parallel processing

  • Compression

  • Indexing

  • Query optimization

Q21. Which AWS services used and AWS architecture for those services

Ans.

AWS services used include S3, Redshift, Glue, EMR, and Lambda in a scalable and cost-effective architecture.

  • AWS S3 for storing large amounts of data

  • AWS Redshift for data warehousing and analytics

  • AWS Glue for ETL processes

  • AWS EMR for big data processing

  • AWS Lambda for serverless computing

Q22. 7) How does query acceleration speed up query processing?

Ans.

Query acceleration speeds up query processing by optimizing query execution and reducing the time taken to retrieve data.

  • Query acceleration uses techniques like indexing, partitioning, and caching to optimize query execution.

  • It reduces the time taken to retrieve data by minimizing disk I/O and utilizing in-memory processing.

  • Examples include using columnar storage formats like Parquet or optimizing join operations.

Q23. Difference between Broadcast variable and accumulator variable

Ans.

Broadcast variables are read-only variables that are cached on each worker node while accumulator variables are write-only variables that are used to accumulate values across multiple tasks.

  • Broadcast variables are used to give every node a copy of a large input dataset or a small lookup table.

  • Accumulator variables are used to keep a running total of values across multiple tasks.

  • Broadcast variables are used for read-only operations while accumulator variables are used for writ...read more

Q24. 1. Different Types of integration runtime in adf 2. How to copy 100 files from one adls path to another in adf 3. Diff between DAG and Lineage , narrow and wide transformation in Spark 4. DBUtils questions. 5....

read more
Ans.

The interview questions cover topics related to Azure Data Factory, Spark, and Python programming.

  • Integration runtimes in ADF include Azure, Self-hosted, and SSIS IRs.

  • To copy 100 files in ADF, use a Copy Data activity with a wildcard path in source and sink datasets.

  • DAG in Spark represents a directed acyclic graph of computation, while lineage tracks the data flow.

  • Narrow transformations in Spark operate on a single partition, wide transformations shuffle data across partition...read more

Q25. Given an sorted array of integers, assume if each element is squared, provide the sorted array for squared elements in the least possible time complexity.

Ans.

Given a sorted array of integers, return the sorted array of squared elements in least time complexity.

  • Create a new array to store squared elements

  • Use two pointers to iterate through the original array from both ends

  • Compare the squared values and add them to the new array in descending order

Q26. Two SQL Codes and Two Python codes like reverse a string ?

Ans.

Reverse a string using SQL and Python codes.

  • In SQL, use the REVERSE function to reverse a string.

  • In Python, use slicing with a step of -1 to reverse a string.

Q27. What is the SQL query to group by employee ID in order to combine the first name and last name with a space?

Ans.

SQL query to group by employee ID and combine first name and last name with a space

  • Use the GROUP BY clause to group by employee ID

  • Use the CONCAT function to combine first name and last name with a space

  • Select employee ID, CONCAT(first_name, ' ', last_name) AS full_name

Q28. How do you decide on cores and worker nodes?

Ans.

Cores and worker nodes are decided based on the workload requirements and scalability needs of the data processing system.

  • Consider the size and complexity of the data being processed

  • Evaluate the processing speed and memory requirements of the tasks

  • Take into account the parallelism and concurrency needed for efficient data processing

  • Monitor the system performance and adjust cores and worker nodes as needed

Q29. SQL query to find the 2nd most order item in a category

Ans.

Use a SQL query with a subquery to find the 2nd most ordered item in a category.

  • Use a subquery to rank items within each category based on the number of orders

  • Select the item with rank 2 within each category

  • Order the results by category and rank to get the 2nd most ordered item in each category

Q30. Find top 5 countries with highest population in Spark and SQL

Ans.

Use Spark and SQL to find the top 5 countries with the highest population.

  • Use Spark to load the data and perform data processing.

  • Use SQL queries to group by country and sum the population.

  • Order the results in descending order and limit to top 5.

  • Example: SELECT country, SUM(population) AS total_population FROM table_name GROUP BY country ORDER BY total_population DESC LIMIT 5

Q31. What is a catalyst optimiser? How it works?

Ans.

A catalyst optimizer is a query optimization tool used in Apache Spark to improve performance by generating an optimal query plan.

  • Catalyst optimizer is a rule-based query optimization framework in Apache Spark.

  • It leverages rules to transform the logical query plan into a more optimized physical plan.

  • The optimizer applies various optimization techniques like predicate pushdown, constant folding, and join reordering.

  • By optimizing the query plan, it reduces the overall execution...read more

Q32. What is shuffling? How to Handle Shuffling?

Ans.

Shuffling is the process of redistributing data across partitions in a distributed computing environment.

  • Shuffling is necessary when data needs to be grouped or aggregated across different partitions.

  • It can be handled efficiently by minimizing the amount of data being shuffled and optimizing the partitioning strategy.

  • Techniques like partitioning, combiners, and reducers can help reduce the amount of shuffling in MapReduce jobs.

Q33. What are the different joins in SQL? Please give an example to elaborate.

Ans.

Different types of joins in SQL include inner join, left join, right join, and full outer join.

  • Inner join: Returns rows when there is a match in both tables.

  • Left join: Returns all rows from the left table and the matched rows from the right table.

  • Right join: Returns all rows from the right table and the matched rows from the left table.

  • Full outer join: Returns rows when there is a match in either table.

  • Example: SELECT * FROM table1 INNER JOIN table2 ON table1.id = table2.id;

Q34. How would you design a review analytics dashboard for a client partner?

Ans.

I would design a review analytics dashboard by integrating data from various sources, creating visualizations for key metrics, and allowing for customization and interactivity.

  • Integrate data from various sources such as customer reviews, ratings, and feedback.

  • Create visualizations for key metrics like average rating, sentiment analysis, and review volume over time.

  • Allow for customization by letting users filter and drill down into specific data points.

  • Enable interactivity by ...read more

Q35. Write a spark program to find the word with maximum length in a given string

Ans.

Use Spark program to find word with maximum length in a given string

  • Split the string into words using space as delimiter

  • Map each word to its length

  • Find the word with maximum length using reduce operation

Q36. Write an sql query to change rows into columns and vice versa

Ans.

Use SQL pivot function to change rows into columns and vice versa

  • Use the PIVOT function in SQL to transform rows into columns

  • Use the UNPIVOT function in SQL to transform columns into rows

  • Example: SELECT * FROM table_name PIVOT (SUM(value) FOR column_name IN (value1, value2, value3))

  • Example: SELECT * FROM table_name UNPIVOT (value FOR column_name IN (value1, value2, value3))

Q37. write an sql query to fetch duplicate rows in a table

Ans.

SQL query to fetch duplicate rows in a table

  • Use GROUP BY and HAVING clause to identify duplicate rows

  • Select columns to check for duplicates

  • Example: SELECT column1, column2, COUNT(*) FROM table_name GROUP BY column1, column2 HAVING COUNT(*) > 1;

Q38. find the duplicate items in a list

Ans.

To find duplicate items in a list

  • Iterate through the list and compare each item with the rest of the list

  • Use a hash table to keep track of seen items

  • Sort the list and compare adjacent items

Q39. What do you know about Spark architecture?

Ans.

Spark architecture is based on a master-slave architecture with a cluster manager to coordinate tasks.

  • Spark architecture consists of a driver program that communicates with a cluster manager to coordinate tasks.

  • The cluster manager allocates resources and schedules tasks on worker nodes.

  • Worker nodes execute the tasks and return results to the driver program.

  • Spark supports various cluster managers like YARN, Mesos, and standalone mode.

  • Spark applications can run in standalone mo...read more

Q40. What does the purpose of the Spark Submit command in Apache Spark?

Ans.

Spark Submit command is used to submit Spark applications to a cluster.

  • Spark Submit command is used to launch applications on a Spark cluster.

  • It allows users to specify application parameters like main class, jars, and arguments.

  • Users can also configure properties like memory allocation and number of executors.

  • Example: spark-submit --class com.example.Main --master yarn --deploy-mode cluster myApp.jar

Q41. How do you handle Incremental data?

Ans.

Incremental data is handled by identifying new data since the last update and merging it with existing data.

  • Identify new data since last update

  • Merge new data with existing data

  • Update data warehouse or database with incremental changes

Q42. Using two tables find the different records for different joins

Ans.

To find different records for different joins using two tables

  • Use the SQL query to perform different joins like INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL JOIN

  • Identify the key columns in both tables to join on

  • Select the columns from both tables and use WHERE clause to filter out the different records

Q43. What happens when we enforce schema ?

Ans.

Enforcing schema ensures that data conforms to a predefined structure and rules.

  • Ensures data integrity by validating incoming data against predefined schema

  • Helps in maintaining consistency and accuracy of data

  • Prevents data corruption and errors in data processing

  • Can lead to rejection of data that does not adhere to the schema

Q44. What are the common file formats used in data storages? Which one is best for compression?

Ans.

Common file formats used in data storages include CSV, JSON, Parquet, Avro, and ORC. Parquet is best for compression.

  • CSV (Comma-Separated Values) - simple and widely used, but not efficient for large datasets

  • JSON (JavaScript Object Notation) - human-readable and easy to parse, but can be inefficient for storage

  • Parquet - columnar storage format that is highly efficient for compression and query performance

  • Avro - efficient binary format with schema support, good for data serial...read more

Q45. End to End project Architecture and data pipeline working

Ans.

End to end project architecture involves designing and implementing a data pipeline to process and analyze data from various sources.

  • Define project requirements and goals

  • Design data architecture including data sources, storage, processing, and analytics tools

  • Implement data pipeline to extract, transform, and load data

  • Ensure data quality and consistency throughout the pipeline

  • Monitor and optimize performance of the data pipeline

  • Examples: Using Apache Kafka for real-time data s...read more

Q46. SQL what are the condition used in sql? when we have table but we want create

Ans.

SQL conditions are used to filter data based on specified criteria. Common conditions include WHERE, AND, OR, IN, BETWEEN, etc.

  • Common SQL conditions include WHERE, AND, OR, IN, BETWEEN, LIKE, etc.

  • Conditions are used to filter data based on specified criteria in SQL queries.

  • Examples: WHERE salary > 50000, AND department = 'IT', OR age < 30

Q47. Tell me about yourself; Projects done; What is Columnar format file in Spark; Internals of Spark, Difference between OLAP and OLTP; About Datawarehouse- facts, dimensions

Ans.

I am a Senior Data Engineer with experience in various projects involving columnar format files in Spark, understanding Spark internals, OLAP vs OLTP, and data warehousing concepts.

  • Projects: Developed ETL pipelines using Spark for processing large datasets, implemented data quality checks, and optimized query performance.

  • Columnar format file in Spark: It stores data in columnar format to improve query performance by reading only the required columns, like Parquet or ORC files...read more

Q48. What are some of the analytical functions available in SQL?

Ans.

Analytical functions in SQL are used to perform calculations on sets of rows.

  • Aggregate functions like SUM, AVG, COUNT, MIN, MAX

  • Window functions like ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD

  • Ranking functions like NTILE, PERCENT_RANK, CUME_DIST

  • Statistical functions like STDDEV, VARIANCE

  • String functions like CONCAT, SUBSTRING, TRIM

Q49. How to write a file in a delta table?

Ans.

To write a file in a delta table, you can use the Delta Lake API or Spark SQL commands.

  • Use Delta Lake API to write data to a delta table

  • Use Spark SQL commands like INSERT INTO to write data to a delta table

  • Ensure that the data being written is in the correct format and schema

Q50. What is difference between dynamic data frame and spark data frame in aws glue job? can we change dynamic dataframe to spark datafrmae?

Ans.

Dynamic data frame in AWS Glue job is a dynamically generated data frame, while Spark data frame is specifically created using Spark APIs.

  • Dynamic data frame is generated dynamically at runtime based on the data source and schema, while Spark data frame is explicitly created using Spark APIs.

  • Dynamic data frame is more flexible but may have performance implications compared to Spark data frame.

  • You can convert a dynamic data frame to a Spark data frame by explicitly defining the...read more

1
2
3
4
5
6
Next
Interview Tips & Stories
Ace your next interview with expert advice and inspiring stories

Interview experiences of popular companies

3.7
 • 10.4k Interviews
3.9
 • 8.1k Interviews
3.8
 • 3k Interviews
4.0
 • 2.4k Interviews
3.8
 • 533 Interviews
3.3
 • 513 Interviews
3.7
 • 220 Interviews
3.8
 • 212 Interviews
View all

Calculate your in-hand salary

Confused about how your in-hand salary is calculated? Enter your annual salary (CTC) and get your in-hand salary

Senior Data Engineer Interview Questions
Share an Interview
Stay ahead in your career. Get AmbitionBox app
qr-code
Helping over 1 Crore job seekers every month in choosing their right fit company
65 L+

Reviews

4 L+

Interviews

4 Cr+

Salaries

1 Cr+

Users/Month

Contribute to help millions

Made with ❤️ in India. Trademarks belong to their respective owners. All rights reserved © 2024 Info Edge (India) Ltd.

Follow us
  • Youtube
  • Instagram
  • LinkedIn
  • Facebook
  • Twitter