Senior Data Engineer
200+ Senior Data Engineer Interview Questions and Answers
Q1. Write a query to get the customer with the highest total order value for each year, month. [Note: Order table is different and Customer table is different. Order_ID and Customer_ID are the PK of the table with ...
read moreQuery to get the customer with the highest total order value for each year, month.
Join the Order and Customer tables on the foreign key
Group the results by year, month, and customer
Calculate the total order value for each group
Find the maximum total order value for each year, month
If there are multiple customers with the same highest total order value, select the one with the lower Customer_ID
Q2. There are 10 million records in the table and the schema does not contain the ModifiedDate column. One cell was modified the next day in the table. How will you fetch that particular information that needs to b...
read moreTo fetch the modified information without ModifiedDate column from a table with 10 million records.
Create a trigger to capture the modified information and insert it into a separate table with ModifiedDate column.
Use a tool like Change Data Capture (CDC) to track changes in the table and extract the modified information.
Use a query to compare the current table with a backup copy taken the previous day to identify the modified information.
Senior Data Engineer Interview Questions and Answers for Freshers
Q3. How do you handle data pipeline when the schema information keeps changing at the source?
Handle changing schema by using schema evolution techniques and version control.
Use schema evolution techniques like adding new fields, renaming fields, and changing data types.
Implement version control to track changes and ensure backward compatibility.
Use tools like Apache Avro or Apache Parquet to store data in a self-describing format.
Implement automated testing to ensure data quality and consistency.
Collaborate with data producers to establish clear communication and doc...read more
Q4. Difference between Parquet and ORC file. Why industry uses parquet over ORC? Can schema evolution happen in ORC?
Parquet and ORC are columnar storage formats. Parquet is preferred due to its cross-platform support and better compression. ORC supports schema evolution.
Parquet is a columnar storage format that is widely used in the industry due to its cross-platform support and better compression.
ORC is another columnar storage format that supports schema evolution.
Parquet is preferred over ORC due to its better compression and support for a wider range of programming languages.
ORC is pre...read more
Q5. What is Normalisation and Denormalisation? When do we use them? Give a real-time example that is implemented in your project.
Normalisation is the process of organizing data in a database to reduce redundancy and improve data integrity. Denormalisation is the opposite process.
Normalisation is used to eliminate data redundancy and improve data integrity.
Denormalisation is used to improve query performance by reducing the number of joins required.
A real-time example of normalisation is breaking down a customer's information into separate tables such as customer details, order details, and payment deta...read more
Q6. What are the different types of schema you know in Data Warehousing?
There are three types of schema in Data Warehousing: Star Schema, Snowflake Schema, and Fact Constellation Schema.
Star Schema: central fact table connected to dimension tables in a star shape
Snowflake Schema: extension of star schema with normalized dimension tables
Fact Constellation Schema: multiple fact tables connected to dimension tables in a complex structure
Share interview questions and help millions of jobseekers 🌟
Q7. How many stages will be created if a spark job has 3 wide transformations and 2 narrow transformations?
There will be 4 stages created in total for the spark job.
Wide transformations trigger a shuffle and create a new stage.
Narrow transformations do not trigger a shuffle and do not create a new stage.
In this case, 3 wide transformations will create 3 new stages and 2 narrow transformations will not create new stages.
Therefore, a total of 4 stages will be created.
Q8. What is the best approach to finding whether the data frame is empty or not?
Use the len() function to check the length of the data frame.
Use len() function to get the number of rows in the data frame.
If the length is 0, then the data frame is empty.
Example: if len(df) == 0: print('Data frame is empty')
Senior Data Engineer Jobs
Q9. How do u calculate the resources based on the number of cores and memory given..ex - 16 cores with 64 gb... What is overhead, driver memory etc
Calculating resources based on cores and memory given with overhead and driver memory
Calculate the total memory available by multiplying the number of cores with memory per core
Deduct the overhead memory required for the operating system and other processes
Allocate driver memory for each executor based on the workload
Consider the memory requirements for other services like Hadoop, Spark, etc.
Example: For 16 cores with 64 GB memory, assuming 1 GB overhead and 2 GB driver memor...read more
Q10. Big data Hadoop architecture and HDFS commands to copy and list files in hdfs spark architecture and Transformation and Action question what happen when we submit spark program spark dataframe coding question s...
read moreQuestions on big data, Hadoop, Spark, Scala, Git, project and Agile.
Hadoop architecture and HDFS commands for copying and listing files in HDFS
Spark architecture and Transformation and Action question
What happens when we submit a Spark program
Spark DataFrame coding question
Scala basic program on List
Git and Github
Project-related question
Agile-related
Q11. What is the difference between Tasks, and stages? About Spark UI?
Tasks and stages are components of the execution plan in Spark UI.
Tasks are the smallest unit of work in Spark, representing a single operation on a partition of data.
Stages are groups of tasks that are executed together as part of a larger computation.
Tasks within a stage can be executed in parallel, while stages are executed sequentially.
Tasks are created based on the transformations and actions in the Spark application.
Stages are created based on the dependencies between R...read more
Q12. how would you pass connections string if lambda is connecting to database
Pass connection string as environment variable or use AWS Secrets Manager
Store connection string as environment variable in Lambda function configuration
Retrieve connection string from AWS Secrets Manager and use it in Lambda function
Use IAM role to grant Lambda function access to database
Encrypt connection string using AWS KMS for added security
Q13. how to migrate 1000s of tables using spark(databricks) notebooks
Use Spark (Databricks) notebooks to migrate 1000s of tables efficiently.
Utilize Spark's parallel processing capabilities to handle large volumes of data
Leverage Databricks notebooks for interactive data exploration and transformation
Automate the migration process using scripts or workflows
Optimize performance by tuning Spark configurations and cluster settings
Q14. What is the difference between repartition and Coelsce?
Repartition increases or decreases the number of partitions in a DataFrame, while Coalesce only decreases the number of partitions.
Repartition can increase or decrease the number of partitions in a DataFrame, leading to a shuffle of data across the cluster.
Coalesce only decreases the number of partitions in a DataFrame without performing a full shuffle, making it more efficient than repartition.
Repartition is typically used when there is a need to increase the number of parti...read more
Q15. Python Coding question : without python methods 1. to check if a list is sorted 2. sort the list , optimize the solution
Check if a list is sorted and sort the list without using Python methods.
To check if a list is sorted, iterate through the list and compare each element with the next one. If any element is greater than the next one, the list is not sorted.
To sort the list without using Python methods, implement a sorting algorithm like bubble sort, selection sort, or insertion sort.
Example for checking if a list is sorted: ['a', 'b', 'c'] is sorted, ['c', 'b', 'a'] is not sorted.
Example for ...read more
Q16. What are optimisation techniques used in the project?
Optimisation techniques used in the project include indexing, query optimization, caching, and parallel processing.
Indexing: Creating indexes on frequently queried columns to improve search performance.
Query optimization: Rewriting queries to make them more efficient and reduce execution time.
Caching: Storing frequently accessed data in memory to reduce the need for repeated database queries.
Parallel processing: Distributing tasks across multiple processors to speed up data p...read more
Q17. what are the optimization techniques used in your project
Optimization techniques used in project
Caching
Parallel processing
Compression
Indexing
Query optimization
Q18. Which AWS services used and AWS architecture for those services
AWS services used include S3, Redshift, Glue, EMR, and Lambda in a scalable and cost-effective architecture.
AWS S3 for storing large amounts of data
AWS Redshift for data warehousing and analytics
AWS Glue for ETL processes
AWS EMR for big data processing
AWS Lambda for serverless computing
Q19. 7) How does query acceleration speed up query processing?
Query acceleration speeds up query processing by optimizing query execution and reducing the time taken to retrieve data.
Query acceleration uses techniques like indexing, partitioning, and caching to optimize query execution.
It reduces the time taken to retrieve data by minimizing disk I/O and utilizing in-memory processing.
Examples include using columnar storage formats like Parquet or optimizing join operations.
Q20. Difference between Broadcast variable and accumulator variable
Broadcast variables are read-only variables that are cached on each worker node while accumulator variables are write-only variables that are used to accumulate values across multiple tasks.
Broadcast variables are used to give every node a copy of a large input dataset or a small lookup table.
Accumulator variables are used to keep a running total of values across multiple tasks.
Broadcast variables are used for read-only operations while accumulator variables are used for writ...read more
Q21. 1. Different Types of integration runtime in adf 2. How to copy 100 files from one adls path to another in adf 3. Diff between DAG and Lineage , narrow and wide transformation in Spark 4. DBUtils questions. 5. ...
read moreThe interview questions cover topics related to Azure Data Factory, Spark, and Python programming.
Integration runtimes in ADF include Azure, Self-hosted, and SSIS IRs.
To copy 100 files in ADF, use a Copy Data activity with a wildcard path in source and sink datasets.
DAG in Spark represents a directed acyclic graph of computation, while lineage tracks the data flow.
Narrow transformations in Spark operate on a single partition, wide transformations shuffle data across partition...read more
Q22. Two SQL Codes and Two Python codes like reverse a string ?
Reverse a string using SQL and Python codes.
In SQL, use the REVERSE function to reverse a string.
In Python, use slicing with a step of -1 to reverse a string.
Q23. What is the SQL query to group by employee ID in order to combine the first name and last name with a space?
SQL query to group by employee ID and combine first name and last name with a space
Use the GROUP BY clause to group by employee ID
Use the CONCAT function to combine first name and last name with a space
Select employee ID, CONCAT(first_name, ' ', last_name) AS full_name
Q24. How do you decide on cores and worker nodes?
Cores and worker nodes are decided based on the workload requirements and scalability needs of the data processing system.
Consider the size and complexity of the data being processed
Evaluate the processing speed and memory requirements of the tasks
Take into account the parallelism and concurrency needed for efficient data processing
Monitor the system performance and adjust cores and worker nodes as needed
Q25. SQL query to find the 2nd most order item in a category
Use a SQL query with a subquery to find the 2nd most ordered item in a category.
Use a subquery to rank items within each category based on the number of orders
Select the item with rank 2 within each category
Order the results by category and rank to get the 2nd most ordered item in each category
Q26. Find top 5 countries with highest population in Spark and SQL
Use Spark and SQL to find the top 5 countries with the highest population.
Use Spark to load the data and perform data processing.
Use SQL queries to group by country and sum the population.
Order the results in descending order and limit to top 5.
Example: SELECT country, SUM(population) AS total_population FROM table_name GROUP BY country ORDER BY total_population DESC LIMIT 5
Q27. What is a catalyst optimiser? How it works?
A catalyst optimizer is a query optimization tool used in Apache Spark to improve performance by generating an optimal query plan.
Catalyst optimizer is a rule-based query optimization framework in Apache Spark.
It leverages rules to transform the logical query plan into a more optimized physical plan.
The optimizer applies various optimization techniques like predicate pushdown, constant folding, and join reordering.
By optimizing the query plan, it reduces the overall execution...read more
Q28. What is shuffling? How to Handle Shuffling?
Shuffling is the process of redistributing data across partitions in a distributed computing environment.
Shuffling is necessary when data needs to be grouped or aggregated across different partitions.
It can be handled efficiently by minimizing the amount of data being shuffled and optimizing the partitioning strategy.
Techniques like partitioning, combiners, and reducers can help reduce the amount of shuffling in MapReduce jobs.
Q29. What are the different joins in SQL? Please give an example to elaborate.
Different types of joins in SQL include inner join, left join, right join, and full outer join.
Inner join: Returns rows when there is a match in both tables.
Left join: Returns all rows from the left table and the matched rows from the right table.
Right join: Returns all rows from the right table and the matched rows from the left table.
Full outer join: Returns rows when there is a match in either table.
Example: SELECT * FROM table1 INNER JOIN table2 ON table1.id = table2.id;
Q30. Write a spark program to find the word with maximum length in a given string
Use Spark program to find word with maximum length in a given string
Split the string into words using space as delimiter
Map each word to its length
Find the word with maximum length using reduce operation
Q31. Write an sql query to change rows into columns and vice versa
Use SQL pivot function to change rows into columns and vice versa
Use the PIVOT function in SQL to transform rows into columns
Use the UNPIVOT function in SQL to transform columns into rows
Example: SELECT * FROM table_name PIVOT (SUM(value) FOR column_name IN (value1, value2, value3))
Example: SELECT * FROM table_name UNPIVOT (value FOR column_name IN (value1, value2, value3))
Q32. write an sql query to fetch duplicate rows in a table
SQL query to fetch duplicate rows in a table
Use GROUP BY and HAVING clause to identify duplicate rows
Select columns to check for duplicates
Example: SELECT column1, column2, COUNT(*) FROM table_name GROUP BY column1, column2 HAVING COUNT(*) > 1;
Q33. What do you know about Spark architecture?
Spark architecture is based on a master-slave architecture with a cluster manager to coordinate tasks.
Spark architecture consists of a driver program that communicates with a cluster manager to coordinate tasks.
The cluster manager allocates resources and schedules tasks on worker nodes.
Worker nodes execute the tasks and return results to the driver program.
Spark supports various cluster managers like YARN, Mesos, and standalone mode.
Spark applications can run in standalone mo...read more
Q34. find the duplicate items in a list
To find duplicate items in a list
Iterate through the list and compare each item with the rest of the list
Use a hash table to keep track of seen items
Sort the list and compare adjacent items
Q35. How do you handle Incremental data?
Incremental data is handled by identifying new data since the last update and merging it with existing data.
Identify new data since last update
Merge new data with existing data
Update data warehouse or database with incremental changes
Q36. Using two tables find the different records for different joins
To find different records for different joins using two tables
Use the SQL query to perform different joins like INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL JOIN
Identify the key columns in both tables to join on
Select the columns from both tables and use WHERE clause to filter out the different records
Q37. What happens when we enforce schema ?
Enforcing schema ensures that data conforms to a predefined structure and rules.
Ensures data integrity by validating incoming data against predefined schema
Helps in maintaining consistency and accuracy of data
Prevents data corruption and errors in data processing
Can lead to rejection of data that does not adhere to the schema
Q38. What are the common file formats used in data storages? Which one is best for compression?
Common file formats used in data storages include CSV, JSON, Parquet, Avro, and ORC. Parquet is best for compression.
CSV (Comma-Separated Values) - simple and widely used, but not efficient for large datasets
JSON (JavaScript Object Notation) - human-readable and easy to parse, but can be inefficient for storage
Parquet - columnar storage format that is highly efficient for compression and query performance
Avro - efficient binary format with schema support, good for data serial...read more
Q39. End to End project Architecture and data pipeline working
End to end project architecture involves designing and implementing a data pipeline to process and analyze data from various sources.
Define project requirements and goals
Design data architecture including data sources, storage, processing, and analytics tools
Implement data pipeline to extract, transform, and load data
Ensure data quality and consistency throughout the pipeline
Monitor and optimize performance of the data pipeline
Examples: Using Apache Kafka for real-time data s...read more
Q40. SQL what are the condition used in sql? when we have table but we want create
SQL conditions are used to filter data based on specified criteria. Common conditions include WHERE, AND, OR, IN, BETWEEN, etc.
Common SQL conditions include WHERE, AND, OR, IN, BETWEEN, LIKE, etc.
Conditions are used to filter data based on specified criteria in SQL queries.
Examples: WHERE salary > 50000, AND department = 'IT', OR age < 30
Q41. Tell me about yourself; Projects done; What is Columnar format file in Spark; Internals of Spark, Difference between OLAP and OLTP; About Datawarehouse- facts, dimensions
I am a Senior Data Engineer with experience in various projects involving columnar format files in Spark, understanding Spark internals, OLAP vs OLTP, and data warehousing concepts.
Projects: Developed ETL pipelines using Spark for processing large datasets, implemented data quality checks, and optimized query performance.
Columnar format file in Spark: It stores data in columnar format to improve query performance by reading only the required columns, like Parquet or ORC files...read more
Q42. What are some of the analytical functions available in SQL?
Analytical functions in SQL are used to perform calculations on sets of rows.
Aggregate functions like SUM, AVG, COUNT, MIN, MAX
Window functions like ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD
Ranking functions like NTILE, PERCENT_RANK, CUME_DIST
Statistical functions like STDDEV, VARIANCE
String functions like CONCAT, SUBSTRING, TRIM
Q43. How to write a file in a delta table?
To write a file in a delta table, you can use the Delta Lake API or Spark SQL commands.
Use Delta Lake API to write data to a delta table
Use Spark SQL commands like INSERT INTO to write data to a delta table
Ensure that the data being written is in the correct format and schema
Q44. What is difference between dynamic data frame and spark data frame in aws glue job? can we change dynamic dataframe to spark datafrmae?
Dynamic data frame in AWS Glue job is a dynamically generated data frame, while Spark data frame is specifically created using Spark APIs.
Dynamic data frame is generated dynamically at runtime based on the data source and schema, while Spark data frame is explicitly created using Spark APIs.
Dynamic data frame is more flexible but may have performance implications compared to Spark data frame.
You can convert a dynamic data frame to a Spark data frame by explicitly defining the...read more
Q45. Do you have experience in Dataflow, Dataproc, cloud composer?
Yes, I have experience in Dataflow, Dataproc, and cloud composer.
I have worked with Dataflow to process and analyze large datasets in real-time.
I have used Dataproc to create and manage Apache Spark and Hadoop clusters for big data processing.
I have experience with cloud composer for orchestrating workflows and managing data pipelines.
Q46. How can you optimize your queries for efficiency in BQ?
Optimizing queries in BigQuery involves using partitioned tables, clustering, and optimizing joins.
Partition tables by date or another relevant column to reduce the amount of data scanned
Use clustering to group related rows together, reducing the amount of data scanned for queries
Avoid unnecessary joins and denormalize data where possible to reduce query complexity
Q47. How to implemented Primary key and foreign key in delta tables.
Primary keys and foreign keys can be implemented in delta tables using constraints and references.
Primary keys can be implemented using the PRIMARY KEY constraint, which ensures that each record in the table has a unique identifier.
Foreign keys can be implemented using the FOREIGN KEY constraint, which establishes a link between two tables based on a common column.
The referenced table must have a primary key defined, and the foreign key column in the referencing table must ha...read more
Q48. After performing joins how many records would be retrieved for inner, left, right and outer joins
The number of records retrieved after performing joins depends on the type of join - inner, left, right, or outer.
Inner join retrieves only the matching records from both tables
Left join retrieves all records from the left table and matching records from the right table
Right join retrieves all records from the right table and matching records from the left table
Outer join retrieves all records from both tables, filling in NULL values for non-matching records
Q49. Write a SQL query to select data from table 2 where data exists in table 1
Use a SQL query to select data from table 2 where data exists in table 1
Use a JOIN statement to link the two tables based on a common column
Specify the columns you want to select from table 2
Use a WHERE clause to check for existence of data in table 1
Q50. How to handle missing data in pyspark dataframe.
Handle missing data in pyspark dataframe by using functions like dropna, fillna, or replace.
Use dropna() function to remove rows with missing data
Use fillna() function to fill missing values with a specified value
Use replace() function to replace missing values with a specified value
Interview Questions of Similar Designations
Top Interview Questions for Senior Data Engineer Related Skills
Interview experiences of popular companies
Calculate your in-hand salary
Confused about how your in-hand salary is calculated? Enter your annual salary (CTC) and get your in-hand salary
Reviews
Interviews
Salaries
Users/Month