Senior Data Engineer

200+ Senior Data Engineer Interview Questions and Answers

Updated 11 Dec 2024

Popular Companies

search-icon

Q1. Write a query to get the customer with the highest total order value for each year, month. [Note: Order table is different and Customer table is different. Order_ID and Customer_ID are the PK of the table with ...

read more
Ans.

Query to get the customer with the highest total order value for each year, month.

  • Join the Order and Customer tables on the foreign key

  • Group the results by year, month, and customer

  • Calculate the total order value for each group

  • Find the maximum total order value for each year, month

  • If there are multiple customers with the same highest total order value, select the one with the lower Customer_ID

Q2. There are 10 million records in the table and the schema does not contain the ModifiedDate column. One cell was modified the next day in the table. How will you fetch that particular information that needs to b...

read more
Ans.

To fetch the modified information without ModifiedDate column from a table with 10 million records.

  • Create a trigger to capture the modified information and insert it into a separate table with ModifiedDate column.

  • Use a tool like Change Data Capture (CDC) to track changes in the table and extract the modified information.

  • Use a query to compare the current table with a backup copy taken the previous day to identify the modified information.

Senior Data Engineer Interview Questions and Answers for Freshers

illustration image

Q3. How do you handle data pipeline when the schema information keeps changing at the source?

Ans.

Handle changing schema by using schema evolution techniques and version control.

  • Use schema evolution techniques like adding new fields, renaming fields, and changing data types.

  • Implement version control to track changes and ensure backward compatibility.

  • Use tools like Apache Avro or Apache Parquet to store data in a self-describing format.

  • Implement automated testing to ensure data quality and consistency.

  • Collaborate with data producers to establish clear communication and doc...read more

Q4. Difference between Parquet and ORC file. Why industry uses parquet over ORC? Can schema evolution happen in ORC?

Ans.

Parquet and ORC are columnar storage formats. Parquet is preferred due to its cross-platform support and better compression. ORC supports schema evolution.

  • Parquet is a columnar storage format that is widely used in the industry due to its cross-platform support and better compression.

  • ORC is another columnar storage format that supports schema evolution.

  • Parquet is preferred over ORC due to its better compression and support for a wider range of programming languages.

  • ORC is pre...read more

Are these interview questions helpful?

Q5. What is Normalisation and Denormalisation? When do we use them? Give a real-time example that is implemented in your project.

Ans.

Normalisation is the process of organizing data in a database to reduce redundancy and improve data integrity. Denormalisation is the opposite process.

  • Normalisation is used to eliminate data redundancy and improve data integrity.

  • Denormalisation is used to improve query performance by reducing the number of joins required.

  • A real-time example of normalisation is breaking down a customer's information into separate tables such as customer details, order details, and payment deta...read more

Q6. What are the different types of schema you know in Data Warehousing?

Ans.

There are three types of schema in Data Warehousing: Star Schema, Snowflake Schema, and Fact Constellation Schema.

  • Star Schema: central fact table connected to dimension tables in a star shape

  • Snowflake Schema: extension of star schema with normalized dimension tables

  • Fact Constellation Schema: multiple fact tables connected to dimension tables in a complex structure

Share interview questions and help millions of jobseekers 🌟

man-with-laptop

Q7. How many stages will be created if a spark job has 3 wide transformations and 2 narrow transformations?

Ans.

There will be 4 stages created in total for the spark job.

  • Wide transformations trigger a shuffle and create a new stage.

  • Narrow transformations do not trigger a shuffle and do not create a new stage.

  • In this case, 3 wide transformations will create 3 new stages and 2 narrow transformations will not create new stages.

  • Therefore, a total of 4 stages will be created.

Q8. What is the best approach to finding whether the data frame is empty or not?

Ans.

Use the len() function to check the length of the data frame.

  • Use len() function to get the number of rows in the data frame.

  • If the length is 0, then the data frame is empty.

  • Example: if len(df) == 0: print('Data frame is empty')

Senior Data Engineer Jobs

Senior Data Engineer 7-11 years
TE Connectivity
4.2
₹ 15 L/yr - ₹ 20 L/yr
Bangalore / Bengaluru
Senior Data Engineer/Project Manager 5-8 years
HSBC electronic data processing india pvt ltd
4.2
Pune
Senior Data Engineering 6-10 years
Kyndryl
4.0
Bangalore / Bengaluru

Q9. How do u calculate the resources based on the number of cores and memory given..ex - 16 cores with 64 gb... What is overhead, driver memory etc

Ans.

Calculating resources based on cores and memory given with overhead and driver memory

  • Calculate the total memory available by multiplying the number of cores with memory per core

  • Deduct the overhead memory required for the operating system and other processes

  • Allocate driver memory for each executor based on the workload

  • Consider the memory requirements for other services like Hadoop, Spark, etc.

  • Example: For 16 cores with 64 GB memory, assuming 1 GB overhead and 2 GB driver memor...read more

Q10. Big data Hadoop architecture and HDFS commands to copy and list files in hdfs spark architecture and Transformation and Action question what happen when we submit spark program spark dataframe coding question s...

read more
Ans.

Questions on big data, Hadoop, Spark, Scala, Git, project and Agile.

  • Hadoop architecture and HDFS commands for copying and listing files in HDFS

  • Spark architecture and Transformation and Action question

  • What happens when we submit a Spark program

  • Spark DataFrame coding question

  • Scala basic program on List

  • Git and Github

  • Project-related question

  • Agile-related

Q11. What is the difference between Tasks, and stages? About Spark UI?

Ans.

Tasks and stages are components of the execution plan in Spark UI.

  • Tasks are the smallest unit of work in Spark, representing a single operation on a partition of data.

  • Stages are groups of tasks that are executed together as part of a larger computation.

  • Tasks within a stage can be executed in parallel, while stages are executed sequentially.

  • Tasks are created based on the transformations and actions in the Spark application.

  • Stages are created based on the dependencies between R...read more

Q12. how would you pass connections string if lambda is connecting to database

Ans.

Pass connection string as environment variable or use AWS Secrets Manager

  • Store connection string as environment variable in Lambda function configuration

  • Retrieve connection string from AWS Secrets Manager and use it in Lambda function

  • Use IAM role to grant Lambda function access to database

  • Encrypt connection string using AWS KMS for added security

Q13. how to migrate 1000s of tables using spark(databricks) notebooks

Ans.

Use Spark (Databricks) notebooks to migrate 1000s of tables efficiently.

  • Utilize Spark's parallel processing capabilities to handle large volumes of data

  • Leverage Databricks notebooks for interactive data exploration and transformation

  • Automate the migration process using scripts or workflows

  • Optimize performance by tuning Spark configurations and cluster settings

Q14. What is the difference between repartition and Coelsce?

Ans.

Repartition increases or decreases the number of partitions in a DataFrame, while Coalesce only decreases the number of partitions.

  • Repartition can increase or decrease the number of partitions in a DataFrame, leading to a shuffle of data across the cluster.

  • Coalesce only decreases the number of partitions in a DataFrame without performing a full shuffle, making it more efficient than repartition.

  • Repartition is typically used when there is a need to increase the number of parti...read more

Q15. Python Coding question : without python methods 1. to check if a list is sorted 2. sort the list , optimize the solution

Ans.

Check if a list is sorted and sort the list without using Python methods.

  • To check if a list is sorted, iterate through the list and compare each element with the next one. If any element is greater than the next one, the list is not sorted.

  • To sort the list without using Python methods, implement a sorting algorithm like bubble sort, selection sort, or insertion sort.

  • Example for checking if a list is sorted: ['a', 'b', 'c'] is sorted, ['c', 'b', 'a'] is not sorted.

  • Example for ...read more

Q16. What are optimisation techniques used in the project?

Ans.

Optimisation techniques used in the project include indexing, query optimization, caching, and parallel processing.

  • Indexing: Creating indexes on frequently queried columns to improve search performance.

  • Query optimization: Rewriting queries to make them more efficient and reduce execution time.

  • Caching: Storing frequently accessed data in memory to reduce the need for repeated database queries.

  • Parallel processing: Distributing tasks across multiple processors to speed up data p...read more

Q17. what are the optimization techniques used in your project

Ans.

Optimization techniques used in project

  • Caching

  • Parallel processing

  • Compression

  • Indexing

  • Query optimization

Q18. Which AWS services used and AWS architecture for those services

Ans.

AWS services used include S3, Redshift, Glue, EMR, and Lambda in a scalable and cost-effective architecture.

  • AWS S3 for storing large amounts of data

  • AWS Redshift for data warehousing and analytics

  • AWS Glue for ETL processes

  • AWS EMR for big data processing

  • AWS Lambda for serverless computing

Q19. 7) How does query acceleration speed up query processing?

Ans.

Query acceleration speeds up query processing by optimizing query execution and reducing the time taken to retrieve data.

  • Query acceleration uses techniques like indexing, partitioning, and caching to optimize query execution.

  • It reduces the time taken to retrieve data by minimizing disk I/O and utilizing in-memory processing.

  • Examples include using columnar storage formats like Parquet or optimizing join operations.

Q20. Difference between Broadcast variable and accumulator variable

Ans.

Broadcast variables are read-only variables that are cached on each worker node while accumulator variables are write-only variables that are used to accumulate values across multiple tasks.

  • Broadcast variables are used to give every node a copy of a large input dataset or a small lookup table.

  • Accumulator variables are used to keep a running total of values across multiple tasks.

  • Broadcast variables are used for read-only operations while accumulator variables are used for writ...read more

Q21. 1. Different Types of integration runtime in adf 2. How to copy 100 files from one adls path to another in adf 3. Diff between DAG and Lineage , narrow and wide transformation in Spark 4. DBUtils questions. 5. ...

read more
Ans.

The interview questions cover topics related to Azure Data Factory, Spark, and Python programming.

  • Integration runtimes in ADF include Azure, Self-hosted, and SSIS IRs.

  • To copy 100 files in ADF, use a Copy Data activity with a wildcard path in source and sink datasets.

  • DAG in Spark represents a directed acyclic graph of computation, while lineage tracks the data flow.

  • Narrow transformations in Spark operate on a single partition, wide transformations shuffle data across partition...read more

Q22. Two SQL Codes and Two Python codes like reverse a string ?

Ans.

Reverse a string using SQL and Python codes.

  • In SQL, use the REVERSE function to reverse a string.

  • In Python, use slicing with a step of -1 to reverse a string.

Q23. What is the SQL query to group by employee ID in order to combine the first name and last name with a space?

Ans.

SQL query to group by employee ID and combine first name and last name with a space

  • Use the GROUP BY clause to group by employee ID

  • Use the CONCAT function to combine first name and last name with a space

  • Select employee ID, CONCAT(first_name, ' ', last_name) AS full_name

Q24. How do you decide on cores and worker nodes?

Ans.

Cores and worker nodes are decided based on the workload requirements and scalability needs of the data processing system.

  • Consider the size and complexity of the data being processed

  • Evaluate the processing speed and memory requirements of the tasks

  • Take into account the parallelism and concurrency needed for efficient data processing

  • Monitor the system performance and adjust cores and worker nodes as needed

Q25. SQL query to find the 2nd most order item in a category

Ans.

Use a SQL query with a subquery to find the 2nd most ordered item in a category.

  • Use a subquery to rank items within each category based on the number of orders

  • Select the item with rank 2 within each category

  • Order the results by category and rank to get the 2nd most ordered item in each category

Q26. Find top 5 countries with highest population in Spark and SQL

Ans.

Use Spark and SQL to find the top 5 countries with the highest population.

  • Use Spark to load the data and perform data processing.

  • Use SQL queries to group by country and sum the population.

  • Order the results in descending order and limit to top 5.

  • Example: SELECT country, SUM(population) AS total_population FROM table_name GROUP BY country ORDER BY total_population DESC LIMIT 5

Q27. What is a catalyst optimiser? How it works?

Ans.

A catalyst optimizer is a query optimization tool used in Apache Spark to improve performance by generating an optimal query plan.

  • Catalyst optimizer is a rule-based query optimization framework in Apache Spark.

  • It leverages rules to transform the logical query plan into a more optimized physical plan.

  • The optimizer applies various optimization techniques like predicate pushdown, constant folding, and join reordering.

  • By optimizing the query plan, it reduces the overall execution...read more

Q28. What is shuffling? How to Handle Shuffling?

Ans.

Shuffling is the process of redistributing data across partitions in a distributed computing environment.

  • Shuffling is necessary when data needs to be grouped or aggregated across different partitions.

  • It can be handled efficiently by minimizing the amount of data being shuffled and optimizing the partitioning strategy.

  • Techniques like partitioning, combiners, and reducers can help reduce the amount of shuffling in MapReduce jobs.

Q29. What are the different joins in SQL? Please give an example to elaborate.

Ans.

Different types of joins in SQL include inner join, left join, right join, and full outer join.

  • Inner join: Returns rows when there is a match in both tables.

  • Left join: Returns all rows from the left table and the matched rows from the right table.

  • Right join: Returns all rows from the right table and the matched rows from the left table.

  • Full outer join: Returns rows when there is a match in either table.

  • Example: SELECT * FROM table1 INNER JOIN table2 ON table1.id = table2.id;

Q30. Write a spark program to find the word with maximum length in a given string

Ans.

Use Spark program to find word with maximum length in a given string

  • Split the string into words using space as delimiter

  • Map each word to its length

  • Find the word with maximum length using reduce operation

Q31. Write an sql query to change rows into columns and vice versa

Ans.

Use SQL pivot function to change rows into columns and vice versa

  • Use the PIVOT function in SQL to transform rows into columns

  • Use the UNPIVOT function in SQL to transform columns into rows

  • Example: SELECT * FROM table_name PIVOT (SUM(value) FOR column_name IN (value1, value2, value3))

  • Example: SELECT * FROM table_name UNPIVOT (value FOR column_name IN (value1, value2, value3))

Q32. write an sql query to fetch duplicate rows in a table

Ans.

SQL query to fetch duplicate rows in a table

  • Use GROUP BY and HAVING clause to identify duplicate rows

  • Select columns to check for duplicates

  • Example: SELECT column1, column2, COUNT(*) FROM table_name GROUP BY column1, column2 HAVING COUNT(*) > 1;

Q33. What do you know about Spark architecture?

Ans.

Spark architecture is based on a master-slave architecture with a cluster manager to coordinate tasks.

  • Spark architecture consists of a driver program that communicates with a cluster manager to coordinate tasks.

  • The cluster manager allocates resources and schedules tasks on worker nodes.

  • Worker nodes execute the tasks and return results to the driver program.

  • Spark supports various cluster managers like YARN, Mesos, and standalone mode.

  • Spark applications can run in standalone mo...read more

Q34. find the duplicate items in a list

Ans.

To find duplicate items in a list

  • Iterate through the list and compare each item with the rest of the list

  • Use a hash table to keep track of seen items

  • Sort the list and compare adjacent items

Q35. How do you handle Incremental data?

Ans.

Incremental data is handled by identifying new data since the last update and merging it with existing data.

  • Identify new data since last update

  • Merge new data with existing data

  • Update data warehouse or database with incremental changes

Q36. Using two tables find the different records for different joins

Ans.

To find different records for different joins using two tables

  • Use the SQL query to perform different joins like INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL JOIN

  • Identify the key columns in both tables to join on

  • Select the columns from both tables and use WHERE clause to filter out the different records

Q37. What happens when we enforce schema ?

Ans.

Enforcing schema ensures that data conforms to a predefined structure and rules.

  • Ensures data integrity by validating incoming data against predefined schema

  • Helps in maintaining consistency and accuracy of data

  • Prevents data corruption and errors in data processing

  • Can lead to rejection of data that does not adhere to the schema

Q38. What are the common file formats used in data storages? Which one is best for compression?

Ans.

Common file formats used in data storages include CSV, JSON, Parquet, Avro, and ORC. Parquet is best for compression.

  • CSV (Comma-Separated Values) - simple and widely used, but not efficient for large datasets

  • JSON (JavaScript Object Notation) - human-readable and easy to parse, but can be inefficient for storage

  • Parquet - columnar storage format that is highly efficient for compression and query performance

  • Avro - efficient binary format with schema support, good for data serial...read more

Q39. End to End project Architecture and data pipeline working

Ans.

End to end project architecture involves designing and implementing a data pipeline to process and analyze data from various sources.

  • Define project requirements and goals

  • Design data architecture including data sources, storage, processing, and analytics tools

  • Implement data pipeline to extract, transform, and load data

  • Ensure data quality and consistency throughout the pipeline

  • Monitor and optimize performance of the data pipeline

  • Examples: Using Apache Kafka for real-time data s...read more

Q40. SQL what are the condition used in sql? when we have table but we want create

Ans.

SQL conditions are used to filter data based on specified criteria. Common conditions include WHERE, AND, OR, IN, BETWEEN, etc.

  • Common SQL conditions include WHERE, AND, OR, IN, BETWEEN, LIKE, etc.

  • Conditions are used to filter data based on specified criteria in SQL queries.

  • Examples: WHERE salary > 50000, AND department = 'IT', OR age < 30

Q41. Tell me about yourself; Projects done; What is Columnar format file in Spark; Internals of Spark, Difference between OLAP and OLTP; About Datawarehouse- facts, dimensions

Ans.

I am a Senior Data Engineer with experience in various projects involving columnar format files in Spark, understanding Spark internals, OLAP vs OLTP, and data warehousing concepts.

  • Projects: Developed ETL pipelines using Spark for processing large datasets, implemented data quality checks, and optimized query performance.

  • Columnar format file in Spark: It stores data in columnar format to improve query performance by reading only the required columns, like Parquet or ORC files...read more

Q42. What are some of the analytical functions available in SQL?

Ans.

Analytical functions in SQL are used to perform calculations on sets of rows.

  • Aggregate functions like SUM, AVG, COUNT, MIN, MAX

  • Window functions like ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD

  • Ranking functions like NTILE, PERCENT_RANK, CUME_DIST

  • Statistical functions like STDDEV, VARIANCE

  • String functions like CONCAT, SUBSTRING, TRIM

Q43. How to write a file in a delta table?

Ans.

To write a file in a delta table, you can use the Delta Lake API or Spark SQL commands.

  • Use Delta Lake API to write data to a delta table

  • Use Spark SQL commands like INSERT INTO to write data to a delta table

  • Ensure that the data being written is in the correct format and schema

Q44. What is difference between dynamic data frame and spark data frame in aws glue job? can we change dynamic dataframe to spark datafrmae?

Ans.

Dynamic data frame in AWS Glue job is a dynamically generated data frame, while Spark data frame is specifically created using Spark APIs.

  • Dynamic data frame is generated dynamically at runtime based on the data source and schema, while Spark data frame is explicitly created using Spark APIs.

  • Dynamic data frame is more flexible but may have performance implications compared to Spark data frame.

  • You can convert a dynamic data frame to a Spark data frame by explicitly defining the...read more

Q45. Do you have experience in Dataflow, Dataproc, cloud composer?

Ans.

Yes, I have experience in Dataflow, Dataproc, and cloud composer.

  • I have worked with Dataflow to process and analyze large datasets in real-time.

  • I have used Dataproc to create and manage Apache Spark and Hadoop clusters for big data processing.

  • I have experience with cloud composer for orchestrating workflows and managing data pipelines.

Q46. How can you optimize your queries for efficiency in BQ?

Ans.

Optimizing queries in BigQuery involves using partitioned tables, clustering, and optimizing joins.

  • Partition tables by date or another relevant column to reduce the amount of data scanned

  • Use clustering to group related rows together, reducing the amount of data scanned for queries

  • Avoid unnecessary joins and denormalize data where possible to reduce query complexity

Q47. How to implemented Primary key and foreign key in delta tables.

Ans.

Primary keys and foreign keys can be implemented in delta tables using constraints and references.

  • Primary keys can be implemented using the PRIMARY KEY constraint, which ensures that each record in the table has a unique identifier.

  • Foreign keys can be implemented using the FOREIGN KEY constraint, which establishes a link between two tables based on a common column.

  • The referenced table must have a primary key defined, and the foreign key column in the referencing table must ha...read more

Q48. After performing joins how many records would be retrieved for inner, left, right and outer joins

Ans.

The number of records retrieved after performing joins depends on the type of join - inner, left, right, or outer.

  • Inner join retrieves only the matching records from both tables

  • Left join retrieves all records from the left table and matching records from the right table

  • Right join retrieves all records from the right table and matching records from the left table

  • Outer join retrieves all records from both tables, filling in NULL values for non-matching records

Q49. Write a SQL query to select data from table 2 where data exists in table 1

Ans.

Use a SQL query to select data from table 2 where data exists in table 1

  • Use a JOIN statement to link the two tables based on a common column

  • Specify the columns you want to select from table 2

  • Use a WHERE clause to check for existence of data in table 1

Q50. How to handle missing data in pyspark dataframe.

Ans.

Handle missing data in pyspark dataframe by using functions like dropna, fillna, or replace.

  • Use dropna() function to remove rows with missing data

  • Use fillna() function to fill missing values with a specified value

  • Use replace() function to replace missing values with a specified value

1
2
3
4
5
6
Next
Interview Tips & Stories
Ace your next interview with expert advice and inspiring stories

Interview experiences of popular companies

3.7
 • 10k Interviews
3.9
 • 7.8k Interviews
3.6
 • 2.3k Interviews
4.1
 • 2.3k Interviews
3.7
 • 506 Interviews
3.4
 • 492 Interviews
3.6
 • 208 Interviews
3.8
 • 199 Interviews
View all

Calculate your in-hand salary

Confused about how your in-hand salary is calculated? Enter your annual salary (CTC) and get your in-hand salary

Senior Data Engineer Interview Questions
Share an Interview
Stay ahead in your career. Get AmbitionBox app
qr-code
Helping over 1 Crore job seekers every month in choosing their right fit company
65 L+

Reviews

4 L+

Interviews

4 Cr+

Salaries

1 Cr+

Users/Month

Contribute to help millions
Get AmbitionBox app

Made with ❤️ in India. Trademarks belong to their respective owners. All rights reserved © 2024 Info Edge (India) Ltd.

Follow us
  • Youtube
  • Instagram
  • LinkedIn
  • Facebook
  • Twitter