Senior Data Engineer

filter-iconFilter interviews by

200+ Senior Data Engineer Interview Questions and Answers

Updated 1 Mar 2025

Q51. What is your understanding of dbt (data build tool) and its applications in data transformation?

Ans.

dbt is a data build tool used for transforming data in the data warehouse by writing SQL queries and managing dependencies.

  • dbt is an open-source tool that allows data engineers to transform data in the data warehouse using SQL queries.

  • It helps in managing dependencies between different SQL queries and ensures that the data transformation process is efficient and reliable.

  • dbt can be used to create reusable SQL models, run tests on data quality, and document the data transforma...read more

Q52. What are some of the analytical functions available in SQL?

Ans.

Analytical functions in SQL are used to perform calculations on sets of rows.

  • Aggregate functions like SUM, AVG, COUNT, MIN, MAX

  • Window functions like ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD

  • Ranking functions like NTILE, PERCENT_RANK, CUME_DIST

  • Statistical functions like STDDEV, VARIANCE

  • String functions like CONCAT, SUBSTRING, TRIM

Q53. What is difference between dynamic data frame and spark data frame in aws glue job? can we change dynamic dataframe to spark datafrmae?

Ans.

Dynamic data frame in AWS Glue job is a dynamically generated data frame, while Spark data frame is specifically created using Spark APIs.

  • Dynamic data frame is generated dynamically at runtime based on the data source and schema, while Spark data frame is explicitly created using Spark APIs.

  • Dynamic data frame is more flexible but may have performance implications compared to Spark data frame.

  • You can convert a dynamic data frame to a Spark data frame by explicitly defining the...read more

Q54. How to write a file in a delta table?

Ans.

To write a file in a delta table, you can use the Delta Lake API or Spark SQL commands.

  • Use Delta Lake API to write data to a delta table

  • Use Spark SQL commands like INSERT INTO to write data to a delta table

  • Ensure that the data being written is in the correct format and schema

Are these interview questions helpful?

Q55. Do you have experience in Dataflow, Dataproc, cloud composer?

Ans.

Yes, I have experience in Dataflow, Dataproc, and cloud composer.

  • I have worked with Dataflow to process and analyze large datasets in real-time.

  • I have used Dataproc to create and manage Apache Spark and Hadoop clusters for big data processing.

  • I have experience with cloud composer for orchestrating workflows and managing data pipelines.

Q56. How can you optimize your queries for efficiency in BQ?

Ans.

Optimizing queries in BigQuery involves using partitioned tables, clustering, and optimizing joins.

  • Partition tables by date or another relevant column to reduce the amount of data scanned

  • Use clustering to group related rows together, reducing the amount of data scanned for queries

  • Avoid unnecessary joins and denormalize data where possible to reduce query complexity

Share interview questions and help millions of jobseekers 🌟

man-with-laptop

Q57. How do you utilize the enhanced optimization option in AWS Glue?

Ans.

Enhanced optimization in AWS Glue improves job performance by automatically adjusting resources based on workload

  • Enhanced optimization in AWS Glue automatically adjusts resources like DPUs based on workload

  • It helps improve job performance by optimizing resource allocation

  • Users can enable enhanced optimization in AWS Glue job settings

Q58. What are the best practices for optimizing querying in Amazon Redshift?

Ans.

Optimizing querying in Amazon Redshift involves proper table design, distribution keys, sort keys, and query optimization techniques.

  • Use appropriate distribution keys to evenly distribute data across nodes for parallel processing.

  • Utilize sort keys to physically order data on disk, reducing the need for sorting during queries.

  • Avoid using SELECT * and instead specify only the columns needed to reduce data transfer.

  • Use ANALYZE command to update statistics for the query planner t...read more

Senior Data Engineer Jobs

Senior Data Engineer 8-13 years
Ericsson India Global Services Pvt. Ltd.
4.1
Bangalore / Bengaluru
Senior Data Engineer 8-13 years
Volvo Penta
4.1
Bangalore / Bengaluru
Senior Data Engineer 3-6 years
Red Hat India Pvt Ltd
4.3
Pune

Q59. After performing joins how many records would be retrieved for inner, left, right and outer joins

Ans.

The number of records retrieved after performing joins depends on the type of join - inner, left, right, or outer.

  • Inner join retrieves only the matching records from both tables

  • Left join retrieves all records from the left table and matching records from the right table

  • Right join retrieves all records from the right table and matching records from the left table

  • Outer join retrieves all records from both tables, filling in NULL values for non-matching records

Q60. How to implemented Primary key and foreign key in delta tables.

Ans.

Primary keys and foreign keys can be implemented in delta tables using constraints and references.

  • Primary keys can be implemented using the PRIMARY KEY constraint, which ensures that each record in the table has a unique identifier.

  • Foreign keys can be implemented using the FOREIGN KEY constraint, which establishes a link between two tables based on a common column.

  • The referenced table must have a primary key defined, and the foreign key column in the referencing table must ha...read more

Q61. Write a SQL query to select data from table 2 where data exists in table 1

Ans.

Use a SQL query to select data from table 2 where data exists in table 1

  • Use a JOIN statement to link the two tables based on a common column

  • Specify the columns you want to select from table 2

  • Use a WHERE clause to check for existence of data in table 1

Q62. How to handle missing data in pyspark dataframe.

Ans.

Handle missing data in pyspark dataframe by using functions like dropna, fillna, or replace.

  • Use dropna() function to remove rows with missing data

  • Use fillna() function to fill missing values with a specified value

  • Use replace() function to replace missing values with a specified value

Q63. How would you ensure that your ADF pipeline does not fail?

Ans.

To ensure ADF pipeline does not fail, monitor pipeline health, handle errors gracefully, optimize performance, and conduct regular testing.

  • Monitor pipeline health regularly to identify and address potential issues proactively

  • Handle errors gracefully by implementing error handling mechanisms such as retries, logging, and notifications

  • Optimize performance by tuning pipeline configurations, optimizing data processing logic, and utilizing appropriate resources

  • Conduct regular test...read more

Q64. In Databricks, when a spark is submitted, what happens at backend. Explain the flow?

Ans.

When a spark is submitted in Databricks, several backend processes are triggered to execute the job.

  • The submitted spark job is divided into tasks by the Spark driver.

  • The tasks are then scheduled to run on the available worker nodes in the cluster.

  • The worker nodes execute the tasks and return the results to the driver.

  • The driver aggregates the results and presents them to the user.

  • Various optimizations such as data shuffling and caching may be applied during the execution proc...read more

Q65. What is vertices and edges on a dag?

Ans.

Vertices are nodes and edges are connections between nodes in a directed acyclic graph (DAG).

  • Vertices represent the tasks or operations in a DAG.

  • Edges represent the dependencies between tasks or operations.

  • Vertices can have multiple incoming edges and outgoing edges.

  • Edges can be weighted to represent the cost or time required to complete a task.

  • Examples of DAGs include data processing pipelines and task scheduling systems.

Q66. What is the process for finding the missing number from a list?

Ans.

To find the missing number from a list, calculate the sum of all numbers in the list and subtract it from the expected sum of the list.

  • Calculate the sum of all numbers in the list using a loop or a built-in function.

  • Calculate the expected sum of the list using the formula n*(n+1)/2, where n is the length of the list.

  • Subtract the sum of the list from the expected sum to find the missing number.

Q67. diff between repartition and colaesce, clustering and partitioning in bq

Ans.

Repartition increases or decreases the number of partitions in a DataFrame, while coalesce only decreases the number of partitions. Clustering is used for organizing data within a partition, while partitioning is dividing data into logical units.

  • Repartition increases or decreases the number of partitions in a DataFrame, which can be useful for parallelism and performance optimization.

  • Coalesce only decreases the number of partitions in a DataFrame, which is more efficient than...read more

Q68. Daily day to day activities in personal & professional life, Work life balance, Indian IT industries

Ans.

Maintaining work-life balance is crucial for personal and professional growth in Indian IT industries.

  • Prioritizing tasks and time management is important for a balanced routine.

  • Taking breaks and engaging in physical activities can help reduce stress and improve productivity.

  • Setting boundaries and communicating with colleagues and family members can help maintain a healthy work-life balance.

  • Indian IT industries are known for their demanding work culture, but companies are now ...read more

Q69. How would you delete duplicate records from a table?

Ans.

To delete duplicate records from a table, you can use the DELETE statement with a self-join or subquery.

  • Identify the duplicate records using a self-join or subquery

  • Use the DELETE statement to remove the duplicate records

  • Consider using a temporary table to store the unique records before deleting the duplicates

Q70. Given the list of words, write the Python program to print the most repeating substring out of all words.

Ans.

Python program to find the most repeating substring in a list of words.

  • Iterate through each word in the list

  • Generate all possible substrings for each word

  • Count the occurrences of each substring using a dictionary

  • Find the substring with the highest count

Q71. What is the difference between Cache() and Persist()?

Ans.

Cache() and Persist() are both used for caching RDDs in Apache Spark, but Persist() allows for more customization.

  • Cache() is a shorthand for Persist(StorageLevel.MEMORY_ONLY)

  • Persist() allows for specifying different storage levels like MEMORY_ONLY, MEMORY_AND_DISK, etc.

  • Persist() also allows for specifying serialization formats like Java serialization, Kryo serialization, etc.

Q72. How DAG handle Fault tolerance?

Ans.

DAGs handle fault tolerance by rerunning failed tasks and maintaining task dependencies.

  • DAGs rerun failed tasks automatically to ensure completion.

  • DAGs maintain task dependencies to ensure proper sequencing.

  • DAGs can be configured to retry failed tasks a certain number of times before marking them as failed.

Q73. What is Databricks, and how does it differ from Azure Data Factory?

Ans.

Databricks is a unified analytics platform for big data and machine learning, while Azure Data Factory is a cloud-based data integration service.

  • Databricks is an integrated workspace for data engineering, data science, and machine learning tasks.

  • Azure Data Factory is a cloud-based data integration service that allows you to create, schedule, and manage data pipelines.

  • Databricks provides collaborative notebooks for data exploration and visualization, while Azure Data Factory f...read more

Q74. Project Experience and data migration end to end

Ans.

I have extensive experience in leading data migration projects from start to finish, ensuring seamless transition and minimal disruption.

  • Led a team in migrating legacy data from on-premise servers to cloud storage

  • Developed data mapping strategies to ensure accurate transfer of information

  • Implemented data validation processes to identify and rectify any discrepancies

  • Collaborated with stakeholders to define project scope and timelines

  • Utilized ETL tools such as Informatica and T...read more

Q75. When we use ssis packages? Difference between union merge

Ans.

SSIS packages are used for ETL processes in SQL Server. Union combines datasets vertically, while merge combines them horizontally.

  • SSIS packages are used for Extract, Transform, Load (ETL) processes in SQL Server.

  • Union in SSIS combines datasets vertically, stacking rows on top of each other.

  • Merge in SSIS combines datasets horizontally, matching rows based on specified columns.

  • Union All in SSIS combines datasets vertically without removing duplicates.

  • Merge Join in SSIS combine...read more

Q76. duplicate table how we create? window function? types of joins? explain each join?

Ans.

To duplicate a table, use CREATE TABLE AS or INSERT INTO SELECT. Window functions are used for calculations across a set of table rows. Types of joins include INNER, LEFT, RIGHT, and FULL OUTER joins.

  • To duplicate a table, use CREATE TABLE AS or INSERT INTO SELECT

  • Window functions are used for calculations across a set of table rows

  • Types of joins include INNER, LEFT, RIGHT, and FULL OUTER joins

  • Explain each join: INNER - returns rows when there is at least one match in both tabl...read more

Q77. SQL joins, number of records for each join

Ans.

SQL joins combine data from two or more tables based on a related column between them.

  • Different types of SQL joins include INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL JOIN.

  • The number of records in the result of a join depends on the type of join and the data in the tables.

  • For example, an INNER JOIN will only return records that have matching values in both tables, while a LEFT JOIN will return all records from the left table and matching records from the right table.

Q78. SQL query to delete Duplicates using CTEs

Ans.

Using Common Table Expressions (CTEs) to delete duplicates in SQL query

  • Use a CTE to identify duplicates by ranking rows based on a unique identifier

  • Delete rows with duplicate ranks in the main query

Q79. Difference between BigQuery and BIgTable, what is partitioning in BQ

Ans.

BigQuery is a fully managed, serverless data warehouse while BigTable is a NoSQL database for real-time analytics. Partitioning in BQ helps in organizing data for efficient querying.

  • BigQuery is a data warehouse used for analyzing large datasets using SQL queries.

  • BigTable is a NoSQL database used for real-time analytics and high-throughput applications.

  • Partitioning in BigQuery involves dividing large tables into smaller, manageable parts based on a specified column or field.

  • Pa...read more

Q80. How to read a text file as one column in pysaprk and get the word counts

Ans.

Read a text file as one column in pyspark and get word counts

  • Use SparkContext to read the text file and create a DataFrame with one column

  • Split the text column into words using split() function

  • Use groupBy() and count() functions to get the word counts

Q81. How do you migrate tables from snowflake to bigquery

Ans.

Tables can be migrated from Snowflake to BigQuery using tools like Data Transfer Service or manually exporting/importing data.

  • Use Google Cloud Data Transfer Service to automate the migration process

  • Export data from Snowflake in a compatible format like CSV or JSON

  • Import the exported data into BigQuery using tools like Cloud Storage or Dataflow

Q82. how did you ingest kafka data and you day to day activity

Ans.

I ingested Kafka data using Kafka Connect and performed data processing and analysis on a daily basis.

  • Used Kafka Connect to ingest data from various sources into Kafka topics

  • Developed custom Kafka Connect connectors for specific data sources

  • Utilized Kafka Streams for real-time data processing and analysis

  • Worked with schema registry for data serialization and deserialization

Q83. How to identify the root-cause if an executor is taking longer time compared to others

Ans.

Identifying root cause of slow executor compared to others

  • Check resource utilization of the slow executor (CPU, memory, disk)

  • Look for any specific tasks or stages that are taking longer on the slow executor

  • Check for network latency or communication issues affecting the slow executor

  • Monitor garbage collection and JVM metrics for potential bottlenecks

  • Consider data skew or unbalanced data distribution causing slow performance

Q84. Dataflow vs Dataproc, layering processing and curated environments in gcp , Data cleaning

Ans.

Dataflow and Dataproc are both processing services in GCP, but with different approaches and use cases.

  • Dataflow is a fully managed service for executing batch and streaming data processing pipelines.

  • Dataproc is a managed Spark and Hadoop service for running big data processing and analytics workloads.

  • Dataflow provides a serverless and auto-scaling environment, while Dataproc offers more control and flexibility.

  • Dataflow is suitable for real-time streaming and complex data tran...read more

Q85. How do to handle stakeholder's interests?

Ans.

Handle stakeholder's interests by understanding their needs, communicating effectively, and delivering results.

  • Understand the stakeholders' needs and priorities

  • Communicate regularly and effectively with stakeholders

  • Involve stakeholders in decision-making processes

  • Deliver results that align with stakeholders' interests

  • Manage expectations and address concerns promptly

Q86. What are the Types of SCD?

Ans.

Types of SCD include Type 1, Type 2, and Type 3.

  • Type 1 SCD: Overwrites old data with new data, no history is maintained.

  • Type 2 SCD: Maintains historical data by creating new records for changes.

  • Type 3 SCD: Creates separate columns to store historical and current data.

  • Examples: Type 1 - Employee address updates overwrite old address. Type 2 - Employee salary changes create new record with effective date. Type 3 - Employee job title history stored in separate columns.

Q87. How do you do to performance optimization in Spark?

Ans.

Performance optimization in Spark involves tuning configurations, optimizing code, and utilizing caching.

  • Tune Spark configurations such as executor memory, cores, and parallelism

  • Optimize code by reducing unnecessary shuffles, using efficient transformations, and avoiding unnecessary data movements

  • Utilize caching to store intermediate results in memory for faster access

Q88. How to filter data from A dashboard to B dashboard?

Ans.

Use data connectors or APIs to extract and transfer data from one dashboard to another.

  • Utilize data connectors or APIs provided by the dashboard platforms to extract data from A dashboard.

  • Transform the data as needed to match the format of B dashboard.

  • Use data connectors or APIs of B dashboard to transfer the filtered data from A dashboard to B dashboard.

Q89. What are some methods for optimizing Spark performance?

Ans.

Optimizing Spark performance involves tuning configurations, partitioning data, caching, and using efficient transformations.

  • Tune Spark configurations for memory allocation, parallelism, and resource management.

  • Partition data properly to distribute work evenly across nodes and minimize shuffling.

  • Cache intermediate results in memory to avoid recomputation.

  • Use efficient transformations like map, filter, and reduceByKey instead of costly operations like groupByKey.

  • Opt for column...read more

Q90. select the country from address filed seperated by comma. but the location of country in every address field might be different

Ans.

Use string manipulation to extract country from address field separated by comma.

  • Split the address field by comma to get individual components

  • Check each component for country name using a list of countries or regular expressions

  • Return the country name once found

Q91. Do you have hands on experience on big data tools

Ans.

Yes, I have hands-on experience with big data tools.

  • I have worked extensively with Hadoop, Spark, and Kafka.

  • I have experience with data ingestion, processing, and storage using these tools.

  • I have also worked with NoSQL databases like Cassandra and MongoDB.

  • I am familiar with data warehousing concepts and have worked with tools like Redshift and Snowflake.

Q92. Explain a batch datapipeline that you have built?

Ans.

Built a batch datapipeline to process and analyze customer transaction data.

  • Used Apache Spark for distributed data processing

  • Ingested data from various sources like databases and files

  • Performed data cleaning, transformation, and aggregation

  • Utilized SQL for querying and analyzing data

  • Generated reports and visualizations for stakeholders

Q93. 4) Describe the SSO process between Snowflake and Azure Active Directory.

Ans.

SSO process between Snowflake and Azure Active Directory involves configuring SAML-based authentication.

  • Configure Snowflake to use SAML authentication with Azure AD as the identity provider

  • Set up a trust relationship between Snowflake and Azure AD

  • Users authenticate through Azure AD and are granted access to Snowflake resources

  • SSO eliminates the need for separate logins and passwords for Snowflake and Azure AD

Q94. How much data can be stored in MySQL database?

Ans.

The maximum amount of data that can be stored in a MySQL database depends on various factors.

  • The maximum size of a MySQL database is determined by the file system and operating system limitations.

  • The maximum size of a single table in MySQL is 64 terabytes (TB) for InnoDB storage engine and 256 terabytes (TB) for MyISAM storage engine.

  • The maximum number of rows in a table is determined by the maximum value of the AUTO_INCREMENT column.

  • The maximum size of a row in MySQL is 65,5...read more

Q95. Optimization performed in current project

Ans.

Implemented query optimization techniques to improve performance in data processing.

  • Utilized indexing to speed up data retrieval

  • Optimized SQL queries by rewriting them for better performance

  • Implemented caching mechanisms to reduce redundant data processing

Q96. Ques -> Can you define difference b/w Azure data lake and Delta lake.

Ans.

Azure Data Lake is a cloud-based storage and analytics service, while Delta Lake is an open-source storage layer that adds reliability to data lakes.

  • Azure Data Lake is a service provided by Microsoft Azure for storing and analyzing large amounts of data.

  • Delta Lake is an open-source storage layer that adds ACID transactions and schema enforcement to data lakes.

  • Azure Data Lake is a cloud-based solution, while Delta Lake can be used on-premises or in the cloud.

  • Azure Data Lake su...read more

Q97. SQL question: 1. Find the top 3 products for every producer. 2. Create a sample stored procedure.

Ans.

Use SQL to find the top 3 products for every producer and create a sample stored procedure.

  • Use a SQL query with a window function to rank products within each producer.

  • Partition the data by producer and order by product sales to rank the products.

  • Limit the results to the top 3 products for each producer.

  • Create a stored procedure that encapsulates the SQL logic for easy reuse.

Q98. 1. Types of cloud 2.Different Activity in ADF 3.Advance SQL 4.Basic OPPS concepts

Ans.

Questions related to cloud types, ADF activities, advanced SQL, and basic OOPs concepts.

  • Types of cloud include public, private, and hybrid

  • ADF activities include data ingestion, transformation, and loading

  • Advanced SQL includes window functions, subqueries, and joins

  • Basic OOPs concepts include encapsulation, inheritance, and polymorphism

Q99. Time travel, different types of tables in snowflake and their retention periods

Ans.

Snowflake has different types of tables with varying retention periods. Time travel allows accessing historical data.

  • Snowflake has two types of tables: transient and persistent

  • Transient tables are temporary and have a retention period of 1 day by default

  • Persistent tables are permanent and have a retention period of 1 year by default

  • Time travel in Snowflake allows accessing historical data at different points in time

  • Time travel is enabled by default for 1 day for transient tab...read more

Q100. Configure Cluster for 100 TB data

Ans.

To configure a cluster for 100 TB data, consider factors like storage capacity, processing power, network bandwidth, and fault tolerance.

  • Choose a distributed storage system like HDFS or Amazon S3 for scalability and fault tolerance.

  • Select high-capacity servers with sufficient RAM and CPU for processing large volumes of data.

  • Ensure high-speed network connections between nodes to facilitate data transfer.

  • Implement data replication and backup strategies to prevent data loss.

  • Cons...read more

Previous
1
2
3
4
5
6
Next
Interview Tips & Stories
Ace your next interview with expert advice and inspiring stories

Interview experiences of popular companies

3.7
 • 10.4k Interviews
3.8
 • 8.1k Interviews
3.8
 • 2.9k Interviews
4.0
 • 2.3k Interviews
3.7
 • 535 Interviews
3.3
 • 519 Interviews
3.7
 • 222 Interviews
3.8
 • 214 Interviews
View all

Calculate your in-hand salary

Confused about how your in-hand salary is calculated? Enter your annual salary (CTC) and get your in-hand salary

Recently Viewed
DESIGNATION
INTERVIEWS
Kalyan Jewellers
No Interviews
INTERVIEWS
Mukund M Chitale & Co
No Interviews
DESIGNATION
LIST OF COMPANIES
3i Infotech
Locations
INTERVIEWS
S.S. Kothari Mehta & Co
No Interviews
SALARIES
KPIT Technologies
DESIGNATION
JOBS
Tech Mahindra
No Jobs
SALARIES
3i Infotech
Senior Data Engineer Interview Questions
Share an Interview
Stay ahead in your career. Get AmbitionBox app
qr-code
Helping over 1 Crore job seekers every month in choosing their right fit company
65 L+

Reviews

4 L+

Interviews

4 Cr+

Salaries

1 Cr+

Users/Month

Contribute to help millions

Made with ❤️ in India. Trademarks belong to their respective owners. All rights reserved © 2024 Info Edge (India) Ltd.

Follow us
  • Youtube
  • Instagram
  • LinkedIn
  • Facebook
  • Twitter