Senior Data Engineer
200+ Senior Data Engineer Interview Questions and Answers
Q1. Write a query to get the customer with the highest total order value for each year, month. [Note: Order table is different and Customer table is different. Order_ID and Customer_ID are the PK of the table with...
read moreQuery to get the customer with the highest total order value for each year, month.
Join the Order and Customer tables on the foreign key
Group the results by year, month, and customer
Calculate the total order value for each group
Find the maximum total order value for each year, month
If there are multiple customers with the same highest total order value, select the one with the lower Customer_ID
Q2. There are 10 million records in the table and the schema does not contain the ModifiedDate column. One cell was modified the next day in the table. How will you fetch that particular information that needs to b...
read moreTo fetch the modified information without ModifiedDate column from a table with 10 million records.
Create a trigger to capture the modified information and insert it into a separate table with ModifiedDate column.
Use a tool like Change Data Capture (CDC) to track changes in the table and extract the modified information.
Use a query to compare the current table with a backup copy taken the previous day to identify the modified information.
Senior Data Engineer Interview Questions and Answers for Freshers
Q3. How do you handle data pipeline when the schema information keeps changing at the source?
Handle changing schema by using schema evolution techniques and version control.
Use schema evolution techniques like adding new fields, renaming fields, and changing data types.
Implement version control to track changes and ensure backward compatibility.
Use tools like Apache Avro or Apache Parquet to store data in a self-describing format.
Implement automated testing to ensure data quality and consistency.
Collaborate with data producers to establish clear communication and doc...read more
Q4. Difference between Parquet and ORC file. Why industry uses parquet over ORC? Can schema evolution happen in ORC?
Parquet and ORC are columnar storage formats. Parquet is preferred due to its cross-platform support and better compression. ORC supports schema evolution.
Parquet is a columnar storage format that is widely used in the industry due to its cross-platform support and better compression.
ORC is another columnar storage format that supports schema evolution.
Parquet is preferred over ORC due to its better compression and support for a wider range of programming languages.
ORC is pre...read more
Q5. What is Normalisation and Denormalisation? When do we use them? Give a real-time example that is implemented in your project.
Normalisation is the process of organizing data in a database to reduce redundancy and improve data integrity. Denormalisation is the opposite process.
Normalisation is used to eliminate data redundancy and improve data integrity.
Denormalisation is used to improve query performance by reducing the number of joins required.
A real-time example of normalisation is breaking down a customer's information into separate tables such as customer details, order details, and payment deta...read more
Q6. Given infinite coins of some currency of denominations : 1,2,5,10, so In how many unique distinct ways can we obtain a total amount of say: 25 ? Same thing how to do it for non-unique repeated combinations or p...
read moreThere are multiple ways to obtain a total amount of 25 using coins of denominations 1, 2, 5, and 10. The question asks for unique distinct ways and non-unique repeated combinations or permutations.
For unique distinct ways, you can use dynamic programming to calculate the number of ways to reach the target amount.
For non-unique repeated combinations or permutations, you can use a recursive function to generate all possible combinations.
Example for unique distinct ways: [1, 2, ...read more
Share interview questions and help millions of jobseekers 🌟
Q7. What are the different types of schema you know in Data Warehousing?
There are three types of schema in Data Warehousing: Star Schema, Snowflake Schema, and Fact Constellation Schema.
Star Schema: central fact table connected to dimension tables in a star shape
Snowflake Schema: extension of star schema with normalized dimension tables
Fact Constellation Schema: multiple fact tables connected to dimension tables in a complex structure
Q8. Given a non-decreasing array, how can I determine the indices of an element X within it? If the element is not present, the output should be [-1, -1]. For example, for the array [1,2,3,3,5,5,7,8] and X=5, the e...
read moreFind indices of an element in a non-decreasing array
Iterate through the array and keep track of the indices where the element X is found
Return the list of indices or [-1, -1] if element X is not found
Handle edge cases like empty array or X not present in the array
Senior Data Engineer Jobs
Q9. How many stages will be created if a spark job has 3 wide transformations and 2 narrow transformations?
There will be 4 stages created in total for the spark job.
Wide transformations trigger a shuffle and create a new stage.
Narrow transformations do not trigger a shuffle and do not create a new stage.
In this case, 3 wide transformations will create 3 new stages and 2 narrow transformations will not create new stages.
Therefore, a total of 4 stages will be created.
Q10. What is the best approach to finding whether the data frame is empty or not?
Use the len() function to check the length of the data frame.
Use len() function to get the number of rows in the data frame.
If the length is 0, then the data frame is empty.
Example: if len(df) == 0: print('Data frame is empty')
Q11. How do u calculate the resources based on the number of cores and memory given..ex - 16 cores with 64 gb... What is overhead, driver memory etc
Calculating resources based on cores and memory given with overhead and driver memory
Calculate the total memory available by multiplying the number of cores with memory per core
Deduct the overhead memory required for the operating system and other processes
Allocate driver memory for each executor based on the workload
Consider the memory requirements for other services like Hadoop, Spark, etc.
Example: For 16 cores with 64 GB memory, assuming 1 GB overhead and 2 GB driver memor...read more
Q12. Big data Hadoop architecture and HDFS commands to copy and list files in hdfs spark architecture and Transformation and Action question what happen when we submit spark program spark dataframe coding question s...
read moreQuestions on big data, Hadoop, Spark, Scala, Git, project and Agile.
Hadoop architecture and HDFS commands for copying and listing files in HDFS
Spark architecture and Transformation and Action question
What happens when we submit a Spark program
Spark DataFrame coding question
Scala basic program on List
Git and Github
Project-related question
Agile-related
Q13. Explain Transaction Isolation, and what are the various types of Transaction isolation in RDBMS ?
Transaction isolation is a concept in databases that ensures transactions are executed independently of each other.
Transaction isolation levels determine the degree to which one transaction must be isolated from other transactions.
Types of transaction isolation levels include Read Uncommitted, Read Committed, Repeatable Read, and Serializable.
Each isolation level offers a different level of consistency and concurrency control.
For example, in Read Uncommitted isolation level, ...read more
Q14. how would you pass connections string if lambda is connecting to database
Pass connection string as environment variable or use AWS Secrets Manager
Store connection string as environment variable in Lambda function configuration
Retrieve connection string from AWS Secrets Manager and use it in Lambda function
Use IAM role to grant Lambda function access to database
Encrypt connection string using AWS KMS for added security
Q15. What is the difference between Tasks, and stages? About Spark UI?
Tasks and stages are components of the execution plan in Spark UI.
Tasks are the smallest unit of work in Spark, representing a single operation on a partition of data.
Stages are groups of tasks that are executed together as part of a larger computation.
Tasks within a stage can be executed in parallel, while stages are executed sequentially.
Tasks are created based on the transformations and actions in the Spark application.
Stages are created based on the dependencies between R...read more
Q16. how to migrate 1000s of tables using spark(databricks) notebooks
Use Spark (Databricks) notebooks to migrate 1000s of tables efficiently.
Utilize Spark's parallel processing capabilities to handle large volumes of data
Leverage Databricks notebooks for interactive data exploration and transformation
Automate the migration process using scripts or workflows
Optimize performance by tuning Spark configurations and cluster settings
Q17. What is the difference between repartition and Coelsce?
Repartition increases or decreases the number of partitions in a DataFrame, while Coalesce only decreases the number of partitions.
Repartition can increase or decrease the number of partitions in a DataFrame, leading to a shuffle of data across the cluster.
Coalesce only decreases the number of partitions in a DataFrame without performing a full shuffle, making it more efficient than repartition.
Repartition is typically used when there is a need to increase the number of parti...read more
Q18. Python Coding question : without python methods 1. to check if a list is sorted 2. sort the list , optimize the solution
Check if a list is sorted and sort the list without using Python methods.
To check if a list is sorted, iterate through the list and compare each element with the next one. If any element is greater than the next one, the list is not sorted.
To sort the list without using Python methods, implement a sorting algorithm like bubble sort, selection sort, or insertion sort.
Example for checking if a list is sorted: ['a', 'b', 'c'] is sorted, ['c', 'b', 'a'] is not sorted.
Example for ...read more
Q19. What are optimisation techniques used in the project?
Optimisation techniques used in the project include indexing, query optimization, caching, and parallel processing.
Indexing: Creating indexes on frequently queried columns to improve search performance.
Query optimization: Rewriting queries to make them more efficient and reduce execution time.
Caching: Storing frequently accessed data in memory to reduce the need for repeated database queries.
Parallel processing: Distributing tasks across multiple processors to speed up data p...read more
Q20. what are the optimization techniques used in your project
Optimization techniques used in project
Caching
Parallel processing
Compression
Indexing
Query optimization
Q21. Which AWS services used and AWS architecture for those services
AWS services used include S3, Redshift, Glue, EMR, and Lambda in a scalable and cost-effective architecture.
AWS S3 for storing large amounts of data
AWS Redshift for data warehousing and analytics
AWS Glue for ETL processes
AWS EMR for big data processing
AWS Lambda for serverless computing
Q22. 7) How does query acceleration speed up query processing?
Query acceleration speeds up query processing by optimizing query execution and reducing the time taken to retrieve data.
Query acceleration uses techniques like indexing, partitioning, and caching to optimize query execution.
It reduces the time taken to retrieve data by minimizing disk I/O and utilizing in-memory processing.
Examples include using columnar storage formats like Parquet or optimizing join operations.
Q23. Difference between Broadcast variable and accumulator variable
Broadcast variables are read-only variables that are cached on each worker node while accumulator variables are write-only variables that are used to accumulate values across multiple tasks.
Broadcast variables are used to give every node a copy of a large input dataset or a small lookup table.
Accumulator variables are used to keep a running total of values across multiple tasks.
Broadcast variables are used for read-only operations while accumulator variables are used for writ...read more
Q24. 1. Different Types of integration runtime in adf 2. How to copy 100 files from one adls path to another in adf 3. Diff between DAG and Lineage , narrow and wide transformation in Spark 4. DBUtils questions. 5....
read moreThe interview questions cover topics related to Azure Data Factory, Spark, and Python programming.
Integration runtimes in ADF include Azure, Self-hosted, and SSIS IRs.
To copy 100 files in ADF, use a Copy Data activity with a wildcard path in source and sink datasets.
DAG in Spark represents a directed acyclic graph of computation, while lineage tracks the data flow.
Narrow transformations in Spark operate on a single partition, wide transformations shuffle data across partition...read more
Q25. Given an sorted array of integers, assume if each element is squared, provide the sorted array for squared elements in the least possible time complexity.
Given a sorted array of integers, return the sorted array of squared elements in least time complexity.
Create a new array to store squared elements
Use two pointers to iterate through the original array from both ends
Compare the squared values and add them to the new array in descending order
Q26. Two SQL Codes and Two Python codes like reverse a string ?
Reverse a string using SQL and Python codes.
In SQL, use the REVERSE function to reverse a string.
In Python, use slicing with a step of -1 to reverse a string.
Q27. What is the SQL query to group by employee ID in order to combine the first name and last name with a space?
SQL query to group by employee ID and combine first name and last name with a space
Use the GROUP BY clause to group by employee ID
Use the CONCAT function to combine first name and last name with a space
Select employee ID, CONCAT(first_name, ' ', last_name) AS full_name
Q28. How do you decide on cores and worker nodes?
Cores and worker nodes are decided based on the workload requirements and scalability needs of the data processing system.
Consider the size and complexity of the data being processed
Evaluate the processing speed and memory requirements of the tasks
Take into account the parallelism and concurrency needed for efficient data processing
Monitor the system performance and adjust cores and worker nodes as needed
Q29. SQL query to find the 2nd most order item in a category
Use a SQL query with a subquery to find the 2nd most ordered item in a category.
Use a subquery to rank items within each category based on the number of orders
Select the item with rank 2 within each category
Order the results by category and rank to get the 2nd most ordered item in each category
Q30. Find top 5 countries with highest population in Spark and SQL
Use Spark and SQL to find the top 5 countries with the highest population.
Use Spark to load the data and perform data processing.
Use SQL queries to group by country and sum the population.
Order the results in descending order and limit to top 5.
Example: SELECT country, SUM(population) AS total_population FROM table_name GROUP BY country ORDER BY total_population DESC LIMIT 5
Q31. What is a catalyst optimiser? How it works?
A catalyst optimizer is a query optimization tool used in Apache Spark to improve performance by generating an optimal query plan.
Catalyst optimizer is a rule-based query optimization framework in Apache Spark.
It leverages rules to transform the logical query plan into a more optimized physical plan.
The optimizer applies various optimization techniques like predicate pushdown, constant folding, and join reordering.
By optimizing the query plan, it reduces the overall execution...read more
Q32. What is shuffling? How to Handle Shuffling?
Shuffling is the process of redistributing data across partitions in a distributed computing environment.
Shuffling is necessary when data needs to be grouped or aggregated across different partitions.
It can be handled efficiently by minimizing the amount of data being shuffled and optimizing the partitioning strategy.
Techniques like partitioning, combiners, and reducers can help reduce the amount of shuffling in MapReduce jobs.
Q33. What are the different joins in SQL? Please give an example to elaborate.
Different types of joins in SQL include inner join, left join, right join, and full outer join.
Inner join: Returns rows when there is a match in both tables.
Left join: Returns all rows from the left table and the matched rows from the right table.
Right join: Returns all rows from the right table and the matched rows from the left table.
Full outer join: Returns rows when there is a match in either table.
Example: SELECT * FROM table1 INNER JOIN table2 ON table1.id = table2.id;
Q34. How would you design a review analytics dashboard for a client partner?
I would design a review analytics dashboard by integrating data from various sources, creating visualizations for key metrics, and allowing for customization and interactivity.
Integrate data from various sources such as customer reviews, ratings, and feedback.
Create visualizations for key metrics like average rating, sentiment analysis, and review volume over time.
Allow for customization by letting users filter and drill down into specific data points.
Enable interactivity by ...read more
Q35. Write a spark program to find the word with maximum length in a given string
Use Spark program to find word with maximum length in a given string
Split the string into words using space as delimiter
Map each word to its length
Find the word with maximum length using reduce operation
Q36. Write an sql query to change rows into columns and vice versa
Use SQL pivot function to change rows into columns and vice versa
Use the PIVOT function in SQL to transform rows into columns
Use the UNPIVOT function in SQL to transform columns into rows
Example: SELECT * FROM table_name PIVOT (SUM(value) FOR column_name IN (value1, value2, value3))
Example: SELECT * FROM table_name UNPIVOT (value FOR column_name IN (value1, value2, value3))
Q37. write an sql query to fetch duplicate rows in a table
SQL query to fetch duplicate rows in a table
Use GROUP BY and HAVING clause to identify duplicate rows
Select columns to check for duplicates
Example: SELECT column1, column2, COUNT(*) FROM table_name GROUP BY column1, column2 HAVING COUNT(*) > 1;
Q38. find the duplicate items in a list
To find duplicate items in a list
Iterate through the list and compare each item with the rest of the list
Use a hash table to keep track of seen items
Sort the list and compare adjacent items
Q39. What do you know about Spark architecture?
Spark architecture is based on a master-slave architecture with a cluster manager to coordinate tasks.
Spark architecture consists of a driver program that communicates with a cluster manager to coordinate tasks.
The cluster manager allocates resources and schedules tasks on worker nodes.
Worker nodes execute the tasks and return results to the driver program.
Spark supports various cluster managers like YARN, Mesos, and standalone mode.
Spark applications can run in standalone mo...read more
Q40. What does the purpose of the Spark Submit command in Apache Spark?
Spark Submit command is used to submit Spark applications to a cluster.
Spark Submit command is used to launch applications on a Spark cluster.
It allows users to specify application parameters like main class, jars, and arguments.
Users can also configure properties like memory allocation and number of executors.
Example: spark-submit --class com.example.Main --master yarn --deploy-mode cluster myApp.jar
Q41. How do you handle Incremental data?
Incremental data is handled by identifying new data since the last update and merging it with existing data.
Identify new data since last update
Merge new data with existing data
Update data warehouse or database with incremental changes
Q42. Using two tables find the different records for different joins
To find different records for different joins using two tables
Use the SQL query to perform different joins like INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL JOIN
Identify the key columns in both tables to join on
Select the columns from both tables and use WHERE clause to filter out the different records
Q43. What happens when we enforce schema ?
Enforcing schema ensures that data conforms to a predefined structure and rules.
Ensures data integrity by validating incoming data against predefined schema
Helps in maintaining consistency and accuracy of data
Prevents data corruption and errors in data processing
Can lead to rejection of data that does not adhere to the schema
Q44. What are the common file formats used in data storages? Which one is best for compression?
Common file formats used in data storages include CSV, JSON, Parquet, Avro, and ORC. Parquet is best for compression.
CSV (Comma-Separated Values) - simple and widely used, but not efficient for large datasets
JSON (JavaScript Object Notation) - human-readable and easy to parse, but can be inefficient for storage
Parquet - columnar storage format that is highly efficient for compression and query performance
Avro - efficient binary format with schema support, good for data serial...read more
Q45. End to End project Architecture and data pipeline working
End to end project architecture involves designing and implementing a data pipeline to process and analyze data from various sources.
Define project requirements and goals
Design data architecture including data sources, storage, processing, and analytics tools
Implement data pipeline to extract, transform, and load data
Ensure data quality and consistency throughout the pipeline
Monitor and optimize performance of the data pipeline
Examples: Using Apache Kafka for real-time data s...read more
Q46. SQL what are the condition used in sql? when we have table but we want create
SQL conditions are used to filter data based on specified criteria. Common conditions include WHERE, AND, OR, IN, BETWEEN, etc.
Common SQL conditions include WHERE, AND, OR, IN, BETWEEN, LIKE, etc.
Conditions are used to filter data based on specified criteria in SQL queries.
Examples: WHERE salary > 50000, AND department = 'IT', OR age < 30
Q47. Tell me about yourself; Projects done; What is Columnar format file in Spark; Internals of Spark, Difference between OLAP and OLTP; About Datawarehouse- facts, dimensions
I am a Senior Data Engineer with experience in various projects involving columnar format files in Spark, understanding Spark internals, OLAP vs OLTP, and data warehousing concepts.
Projects: Developed ETL pipelines using Spark for processing large datasets, implemented data quality checks, and optimized query performance.
Columnar format file in Spark: It stores data in columnar format to improve query performance by reading only the required columns, like Parquet or ORC files...read more
Q48. What are some of the analytical functions available in SQL?
Analytical functions in SQL are used to perform calculations on sets of rows.
Aggregate functions like SUM, AVG, COUNT, MIN, MAX
Window functions like ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD
Ranking functions like NTILE, PERCENT_RANK, CUME_DIST
Statistical functions like STDDEV, VARIANCE
String functions like CONCAT, SUBSTRING, TRIM
Q49. How to write a file in a delta table?
To write a file in a delta table, you can use the Delta Lake API or Spark SQL commands.
Use Delta Lake API to write data to a delta table
Use Spark SQL commands like INSERT INTO to write data to a delta table
Ensure that the data being written is in the correct format and schema
Q50. What is difference between dynamic data frame and spark data frame in aws glue job? can we change dynamic dataframe to spark datafrmae?
Dynamic data frame in AWS Glue job is a dynamically generated data frame, while Spark data frame is specifically created using Spark APIs.
Dynamic data frame is generated dynamically at runtime based on the data source and schema, while Spark data frame is explicitly created using Spark APIs.
Dynamic data frame is more flexible but may have performance implications compared to Spark data frame.
You can convert a dynamic data frame to a Spark data frame by explicitly defining the...read more
Interview Questions of Similar Designations
Top Interview Questions for Senior Data Engineer Related Skills
Interview experiences of popular companies
Calculate your in-hand salary
Confused about how your in-hand salary is calculated? Enter your annual salary (CTC) and get your in-hand salary
Reviews
Interviews
Salaries
Users/Month