Data Engineer

1000+ Data Engineer Interview Questions and Answers

Updated 3 Mar 2025
search-icon

Q101. how to remove duplicate rows from bigquery? find the month of a given date in bigquery.

Ans.

To remove duplicate rows from BigQuery, use the DISTINCT keyword. To find the month of a given date, use the EXTRACT function.

  • To remove duplicate rows, use SELECT DISTINCT * FROM table_name;

  • To find the month of a given date, use SELECT EXTRACT(MONTH FROM date_column) AS month_name FROM table_name;

  • Make sure to replace 'table_name' and 'date_column' with the appropriate values in your query.

Q102. write sql code to get the city1 city2 distance of table if city1 and city2 tables can repeat

Ans.

SQL code to get the city1 city2 distance of table with repeating city1 and city2 values

  • Use a self join on the table to match city1 and city2

  • Calculate the distance between the cities using appropriate formula

  • Consider using a subquery if needed

Q103. Difference between logical plan and physical plan in pyspark?

Ans.

Logical plan represents the high-level abstract representation of the computation to be performed, while physical plan represents the actual execution plan with specific optimizations and details.

  • Logical plan is a high-level abstract representation of the computation to be performed.

  • Physical plan is the actual execution plan with specific optimizations and details.

  • Logical plan is created first and then optimized to generate the physical plan.

  • Physical plan includes details lik...read more

Q104. Python dataframes and how we use them in project and where at time

Ans.

Python dataframes are used to organize and manipulate data in a tabular format.

  • Dataframes are created using the pandas library in Python.

  • They allow for easy manipulation of data, such as filtering, sorting, and grouping.

  • Dataframes can be used in various projects, such as data analysis, machine learning, and data visualization.

  • Examples of using dataframes include analyzing sales data, predicting customer behavior, and visualizing stock market trends.

Are these interview questions helpful?

Q105. What is afd? build dynamic pipeline spark arcticture sql data flow

Ans.

AFD is not a commonly used term in data engineering. Can you provide more context?

    Q106. how do you create calculated columns in pandas

    Ans.

    Calculated columns in pandas can be created using the assign() method or by directly assigning a new column based on existing columns.

    • Use the assign() method to create calculated columns by specifying the new column name and the calculation formula.

    • Directly assign a new column by referencing existing columns and applying the desired calculation.

    • Example: df.assign(new_column = df['column1'] + df['column2'])

    • Example: df['new_column'] = df['column1'] * 2

    Share interview questions and help millions of jobseekers 🌟

    man-with-laptop

    Q107. What is an Normal Distribution in an Dataset

    Ans.

    A normal distribution in a dataset is a bell-shaped curve where the data is symmetrically distributed around the mean.

    • Data is evenly distributed around the mean

    • 68% of data falls within one standard deviation of the mean

    • 95% of data falls within two standard deviations of the mean

    • Examples: heights of people, test scores in a class

    Q108. Write a python program for list/dictionary (comprehensions)

    Ans.

    Python program for list/dictionary comprehensions

    • List comprehension: [expression for item in iterable]

    • Dictionary comprehension: {key_expression: value_expression for item in iterable}

    • Example: squares = [x**2 for x in range(10)]

    • Example: dict_squares = {x: x**2 for x in range(10)}

    Data Engineer Jobs

    Data Engineer 8-13 years
    Maersk Global Service Centres India Pvt. Ltd.
    4.2
    Bangalore / Bengaluru
    Data Engineer 8-13 years
    Ericsson India Global Services Pvt. Ltd.
    4.1
    Bangalore / Bengaluru
    Sr. Data Engineer - Growth & Marketing 7-12 years
    Uber
    4.2
    Bangalore / Bengaluru

    Q109. Rate yourself out of 5 in Pyspark , Python and SQL

    Ans.

    I would rate myself 4 in Pyspark, 5 in Python, and 4 in SQL.

    • Strong proficiency in Python programming language

    • Experience in working with Pyspark for big data processing

    • Proficient in writing complex SQL queries for data manipulation

    • Familiarity with optimizing queries for performance

    • Hands-on experience in data engineering projects

    Q110. Write a SQL to get Student names who got marks>45 in each subject from Student table

    Ans.

    SQL query to retrieve student names with marks > 45 in each subject

    • Use GROUP BY and HAVING clauses to filter students with marks > 45 in each subject

    • Join Student table with Marks table on student_id to get marks for each student

    • Select student names from Student table based on the conditions

    Q111. Provide a approach to develop ETL pipeline where csv file is dropped in S3 and transform through airflow and load to snowflake.

    Ans.

    Develop ETL pipeline using Airflow to process CSV files in S3 and load data into Snowflake.

    • Set up an S3 sensor in Airflow to detect when a new CSV file is dropped in the specified bucket.

    • Create a custom Python operator in Airflow to read the CSV file from S3, perform necessary transformations, and load data into Snowflake.

    • Use SnowflakeHook in Airflow to establish connection with Snowflake and execute SQL queries to load data.

    • Schedule the ETL pipeline in Airflow to run at spec...read more

    Q112. Calculate second highest salary using SQL as well as pyspark.

    Ans.

    Calculate second highest salary using SQL and pyspark

    • Use SQL query with ORDER BY and LIMIT to get the second highest salary

    • In pyspark, use orderBy() and take() functions to achieve the same result

    Q113. How do you optimize SQL queries?

    Ans.

    Optimizing SQL queries involves using indexes, avoiding unnecessary joins, and optimizing the query structure.

    • Use indexes on columns frequently used in WHERE clauses

    • Avoid using SELECT * and only retrieve necessary columns

    • Optimize joins by using INNER JOIN instead of OUTER JOIN when possible

    • Use EXPLAIN to analyze query performance and make necessary adjustments

    Q114. Find second max salary 9f employee by department using python pandas dataframe

    Ans.

    Find second max salary of employee by department using pandas dataframe

    • Group the dataframe by department

    • Sort the salaries in descending order

    • Select the second highest salary for each department

    Q115. What is ETL, and can you provide examples from your project to illustrate its application?

    Ans.

    ETL stands for Extract, Transform, Load. It is a process used to extract data from various sources, transform it into a consistent format, and load it into a target database.

    • Extract: Retrieving data from different sources such as databases, APIs, or files.

    • Transform: Cleaning, filtering, and structuring the extracted data to fit the target database schema.

    • Load: Loading the transformed data into the target database for analysis or reporting.

    • Example: Extracting customer informat...read more

    Q116. What is the difference between RDD (Resilient Distributed Datasets) and DataFrame in Apache Spark?

    Ans.

    RDD is a low-level abstraction representing a distributed collection of objects, while DataFrame is a higher-level abstraction representing a distributed collection of data organized into named columns.

    • RDD is more suitable for unstructured data and low-level transformations, while DataFrame is more suitable for structured data and high-level abstractions.

    • DataFrames provide optimizations like query optimization and code generation, while RDDs do not.

    • DataFrames support SQL quer...read more

    Q117. What is the difference between the reduceBy and groupBy transformations in Apache Spark?

    Ans.

    reduceBy is used to aggregate data based on key, while groupBy is used to group data based on key.

    • reduceBy is a transformation that combines the values of each key using an associative function and a neutral 'zero value'.

    • groupBy is a transformation that groups the data based on a key and returns a grouped data set.

    • reduceBy is more efficient for aggregating data as it reduces the data before shuffling, while groupBy shuffles all the data before grouping.

    • reduceBy is typically u...read more

    Q118. What is the difference between UNION and UNION ALL?

    Ans.

    UNION combines and removes duplicates, UNION ALL combines without removing duplicates.

    • UNION merges the results of two or more SELECT statements into a single result set.

    • UNION ALL does the same as UNION, but it does not remove duplicates.

    • UNION is more resource-intensive than UNION ALL because it performs a sort operation.

    • UNION ALL is faster than UNION when all records are required.

    • Example: SELECT column1 FROM table1 UNION SELECT column1 FROM table2;

    • Example: SELECT column1 FROM...read more

    Q119. 3 sql queries. architecture for a data pipeline to display hour per activity in last 24 hra on a smartwatch .

    Ans.

    Design a data pipeline to display hourly activity on a smartwatch for the last 24 hours using SQL queries.

    • Create a table to store activity data with columns for timestamp and activity type.

    • Use a SQL query to aggregate activity data by hour for the last 24 hours.

    • Display the results on the smartwatch using a suitable visualization.

    Q120. How to deal with data quality issues

    Ans.

    Data quality issues can be dealt with by identifying the root cause, implementing data validation checks, and establishing data governance policies.

    • Identify the root cause of the data quality issue

    • Implement data validation checks to prevent future issues

    • Establish data governance policies to ensure data accuracy and consistency

    • Regularly monitor and audit data quality

    • Involve stakeholders in the data quality process

    • Use data profiling and cleansing tools

    • Ensure data security and p...read more

    Q121. ETL - How to do full load in SSIS, mention the steps

    Ans.

    To perform a full load in SSIS, you can use the Data Flow Task with a source and destination component.

    • Create a Data Flow Task in the Control Flow tab of the SSIS package.

    • Add a source component to extract data from the source system.

    • Add a destination component to load data into the destination system.

    • Map the columns from the source to the destination.

    • Run the package to execute the full load.

    Q122. ETL- how to do the incremental load in ADF and in SSIS

    Ans.

    Incremental load in ADF and SSIS involves identifying new or updated data and loading only those changes.

    • In ADF, use watermark columns to track the last loaded value and filter data based on this value

    • In SSIS, use CDC (Change Data Capture) components or custom scripts to identify new or updated data

    • Both ADF and SSIS support incremental loading by comparing source and target data to determine changes

    Q123. how to copy data without using multiple activities. Dynamically using loops/ parameterization.

    Ans.

    Use a single activity with dynamic parameterization and loops to copy data.

    • Use a loop to iterate through the data source and destination locations.

    • Parameterize the source and destination locations to dynamically copy data.

    • Utilize a scripting language like Python or PowerShell to implement the logic.

    • Example: Use a Python script with a loop to copy files from one folder to another.

    • Example: Use PowerShell script with dynamic parameters to copy data from one database to another.

    Q124. how to insert non-duplicate data into target table. how many ways we can do.

    Ans.

    To insert non-duplicate data into a target table, you can use methods like using a unique constraint, using a merge statement, or using a temporary table.

    • Use a unique constraint on the target table to prevent duplicate entries.

    • Use a merge statement to insert data into the target table only if it does not already exist.

    • Use a temporary table to store the new data, then insert only the non-duplicate records into the target table.

    Q125. Have you work on Lambda Function Explain it?

    Ans.

    Lambda function is a serverless computing service that runs code in response to events and automatically manages the computing resources required.

    • Lambda functions are event-driven and can be triggered by various AWS services such as S3, DynamoDB, API Gateway, etc.

    • They are written in languages like Python, Node.js, Java, etc.

    • Lambda functions are scalable and cost-effective as you only pay for the compute time you consume.

    • They can be used for data processing, real-time file pro...read more

    Q126. Explain databricks dlt, and when will you use batch vs streaming?

    Ans.

    Databricks DLT is a unified data management platform for batch and streaming processing.

    • Databricks DLT (Delta Lake Table) is a storage layer that brings ACID transactions to Apache Spark and big data workloads.

    • Batch processing is used when data is collected over a period of time and processed in large chunks, while streaming processing is used for real-time data processing.

    • Use batch processing for historical data analysis, ETL jobs, and periodic reporting. Use streaming proce...read more

    Q127. How to split staged data’s row into separate columns

    Ans.

    Use SQL functions like SUBSTRING and CHARINDEX to split staged data's row into separate columns

    • Use SUBSTRING function to extract specific parts of the row

    • Use CHARINDEX function to find the position of a specific character in the row

    • Use CASE statements to create separate columns based on conditions

    Q128. what is difference repartition and coalesce

    Ans.

    Repartition increases the number of partitions in a DataFrame, while coalesce reduces the number of partitions without shuffling data.

    • Repartition involves a full shuffle of the data across the cluster, which can be expensive.

    • Coalesce minimizes data movement by only creating new partitions if necessary.

    • Repartition is typically used when increasing parallelism or evenly distributing data, while coalesce is used for reducing the number of partitions without a full shuffle.

    • Exampl...read more

    Q129. 4. How to connect SQL server to databricks

    Ans.

    To connect SQL server to Databricks, use JDBC/ODBC drivers and configure the connection settings.

    • Install the appropriate JDBC/ODBC driver for SQL server

    • Configure the connection settings in Databricks

    • Use the JDBC/ODBC driver to establish the connection

    Q130. diff btw view and indexed view, pipeline design process, where to monitor pipelibe failures and how to fix

    Ans.

    View vs indexed view, pipeline design process, monitoring pipeline failures and fixing

    • View is a virtual table based on a SELECT query, while indexed view is a view with a clustered index for faster retrieval

    • Pipeline design process involves defining data sources, transformations, and destinations

    • Monitor pipeline failures using logging tools like Apache NiFi or Airflow

    • Fix pipeline failures by identifying the root cause, adjusting configurations, or updating dependencies

    Q131. What are types of structured query languages

    Ans.

    Types of structured query languages include SQL, PL/SQL, T-SQL, and others.

    • SQL (Structured Query Language) - widely used for managing relational databases

    • PL/SQL (Procedural Language/SQL) - Oracle's proprietary extension for SQL

    • T-SQL (Transact-SQL) - Microsoft's extension for SQL used in SQL Server

    • Others - include languages like MySQL, PostgreSQL, SQLite, etc.

    Q132. What are some SQL queries that utilize joins and window functions?

    Ans.

    SQL queries using joins and window functions

    • Use INNER JOIN to combine rows from two or more tables based on a related column

    • Use WINDOW functions like ROW_NUMBER() to assign a unique sequential integer to each row within a partition

    • Example: SELECT column1, column2, ROW_NUMBER() OVER(PARTITION BY column1 ORDER BY column2) AS row_num FROM table_name

    Q133. Are you familiar with Celebal Technologies

    Ans.

    Celebal Technologies is a technology company specializing in data engineering and analytics solutions.

    • Celebal Technologies is known for providing data engineering and analytics solutions.

    • They offer services such as data integration, data warehousing, and data visualization.

    • Celebal Technologies works with clients across various industries to help them optimize their data processes.

    • They have expertise in technologies like Hadoop, Spark, and Python for data engineering.

    • The compa...read more

    Q134. What are the technologies you have worked on?

    Ans.

    I have worked on various technologies including Hadoop, Spark, SQL, Python, and AWS.

    • Experience with Hadoop and Spark for big data processing

    • Proficient in SQL for data querying and manipulation

    • Skilled in Python for data analysis and scripting

    • Familiarity with AWS services such as S3, EC2, and EMR

    • Knowledge of data warehousing and ETL processes

    Q135. How to remove Duplicates in Data frame using pyspark?

    Ans.

    Use dropDuplicates() function in pyspark to remove duplicates in a data frame.

    • Use dropDuplicates() function on the data frame to remove duplicates based on all columns.

    • Specify subset of columns to remove duplicates based on specific columns.

    • Use the distinct() function to remove duplicates and keep only distinct rows.

    Q136. How you do the alerting mechanism in adf for failed pipelines

    Ans.

    Alerting mechanism in ADF for failed pipelines involves setting up alerts in Azure Monitor and configuring email notifications.

    • Set up alerts in Azure Monitor for monitoring pipeline runs

    • Configure alert rules to trigger notifications when a pipeline run fails

    • Use Azure Logic Apps to send email notifications for failed pipeline runs

    Q137. What is difference between Primary and unique key in dbms?

    Ans.

    Primary key uniquely identifies a record in a table, while unique key ensures all values in a column are distinct.

    • Primary key does not allow NULL values, while unique key allows one NULL value.

    • A table can have only one primary key, but multiple unique keys.

    • Primary key is a combination of unique and not null constraints.

    • Primary key is used to establish relationships between tables, while unique key is used to enforce uniqueness in a column.

    Q138. what operator is used in composer to move data from gcs to bq

    Ans.

    The operator used in Composer to move data from GCS to BigQuery is the GCS to BigQuery operator.

    • The GCS to BigQuery operator is used in Apache Airflow, which is the underlying technology of Composer.

    • This operator allows you to transfer data from Google Cloud Storage (GCS) to BigQuery.

    • You can specify the source and destination parameters in the operator to define the data transfer process.

    Q139. How would you build a pipeline to connect http source and bring data in adls

    Ans.

    Build a pipeline to connect http source and bring data in adls

    • Set up a data ingestion tool like Apache NiFi or Azure Data Factory to pull data from the http source

    • Transform the data as needed using tools like Apache Spark or Azure Databricks

    • Store the data in Azure Data Lake Storage (ADLS) for further processing and analysis

    Q140. How have you managed experiences involving strict timelines and deliverables?

    Ans.

    I have successfully managed experiences involving strict timelines and deliverables by prioritizing tasks, setting clear goals, and communicating effectively with team members.

    • Prioritizing tasks based on deadlines and importance

    • Setting clear goals and milestones to track progress

    • Communicating effectively with team members to ensure everyone is on the same page

    • Proactively identifying potential roadblocks and finding solutions to overcome them

    Q141. Python vs Java advantages and disadvantages,Kubernetes

    Ans.

    Python is more flexible and easier to learn, while Java is more performant and better for large-scale projects. Kubernetes is a popular container orchestration tool.

    • Python is more concise and easier to read/write than Java

    • Java is more performant and better for large-scale projects

    • Kubernetes is a popular container orchestration tool used for managing containerized applications

    • Kubernetes provides features like automatic scaling, self-healing, and rolling updates

    • Python is often ...read more

    Q142. What is the SQL query to find the third highest salary from a given table?

    Ans.

    Use SQL query with ORDER BY and LIMIT to find the third highest salary from a table.

    • Use ORDER BY clause to sort salaries in descending order

    • Use LIMIT 1 OFFSET 2 to skip the first two highest salaries

    • Example: SELECT salary FROM employees ORDER BY salary DESC LIMIT 1 OFFSET 2

    Q143. How will you design/configure a cluster if you have given 10 petabytes of data.

    Ans.

    Designing/configuring a cluster for 10 petabytes of data involves considerations for storage capacity, processing power, network bandwidth, and fault tolerance.

    • Consider using a distributed file system like HDFS or object storage like Amazon S3 to store and manage the large volume of data.

    • Implement a scalable processing framework like Apache Spark or Hadoop to efficiently process and analyze the data in parallel.

    • Utilize a cluster management system like Apache Mesos or Kubernet...read more

    Q144. How you will run a child notebook into a parent notebook using dbutils command

    Ans.

    Use dbutils.notebook.run() command to run a child notebook in a parent notebook

    • Use dbutils.notebook.run() command with the path to the child notebook and any parameters needed

    • Ensure that the child notebook is accessible and has necessary permissions

    • Handle any return values or errors from the child notebook appropriately

    Q145. What will be spark configuration to process 2 gb of data

    Ans.

    Set spark configuration with appropriate memory and cores for efficient processing of 2 GB data

    • Increase executor memory and cores to handle larger data size

    • Adjust spark memory overhead to prevent out of memory errors

    • Optimize shuffle partitions for better performance

    Q146. What is difference between hadoop and spark? Difference between coalesce and repartition? Sql query HDFS

    Ans.

    Hadoop is a distributed storage and processing framework, while Spark is a fast and general-purpose cluster computing system.

    • Hadoop is primarily used for batch processing of large datasets, while Spark is known for its in-memory processing capabilities.

    • Hadoop uses MapReduce for processing data, while Spark uses Resilient Distributed Datasets (RDDs).

    • Coalesce is used to reduce the number of partitions in a DataFrame or RDD without shuffling data, while repartition is used to in...read more

    Q147. Assume below Dataframes DF1 (UserID,Name) DF2 (UserID,PageID,Timestamp,Events) Write code to Join the DF's, Count the No of Events and filter Users with 0 Events

    Ans.

    Join DF's, count events, filter users with 0 events

    • Use join operation to combine DF1 and DF2 on UserID

    • Group by UserID and count the number of events

    • Filter out users with 0 events

    Q148. 2 types of modes for Spark architecture ?

    Ans.

    The two types of modes for Spark architecture are standalone mode and cluster mode.

    • Standalone mode: Spark runs on a single machine with a single JVM and is suitable for development and testing.

    • Cluster mode: Spark runs on a cluster of machines managed by a cluster manager like YARN or Mesos for production workloads.

    Q149. What is Re-Partition and Coalesce? How are these used?

    Ans.

    Re-Partition and Coalesce are methods used to control the number of partitions in a dataset in Apache Spark.

    • Re-Partition is used to increase or decrease the number of partitions in a dataset by shuffling the data across the cluster.

    • Coalesce is used to decrease the number of partitions in a dataset without shuffling the data, which can improve performance.

    • Re-Partition is typically used when there is a need to increase parallelism or balance the data distribution, while Coalesc...read more

    Q150. How would you build a pipeline for a Machine learning project?

    Ans.

    To build a pipeline for a Machine learning project, you need to collect data, preprocess it, train the model, evaluate its performance, and deploy it.

    • Collect relevant data from various sources

    • Preprocess the data by cleaning, transforming, and normalizing it

    • Split the data into training and testing sets

    • Train the machine learning model using the training data

    • Evaluate the model's performance using the testing data

    • Fine-tune the model if necessary

    • Deploy the model into production en...read more

    Previous
    1
    2
    3
    4
    5
    6
    7
    Next
    Interview Tips & Stories
    Ace your next interview with expert advice and inspiring stories

    Interview experiences of popular companies

    3.7
     • 10.4k Interviews
    3.8
     • 8.1k Interviews
    3.6
     • 7.5k Interviews
    3.7
     • 5.6k Interviews
    3.7
     • 5.6k Interviews
    4.1
     • 5k Interviews
    3.7
     • 4.8k Interviews
    3.8
     • 2.9k Interviews
    4.0
     • 2.3k Interviews
    View all

    Calculate your in-hand salary

    Confused about how your in-hand salary is calculated? Enter your annual salary (CTC) and get your in-hand salary

    Data Engineer Interview Questions
    Share an Interview
    Stay ahead in your career. Get AmbitionBox app
    qr-code
    Helping over 1 Crore job seekers every month in choosing their right fit company
    65 L+

    Reviews

    4 L+

    Interviews

    4 Cr+

    Salaries

    1 Cr+

    Users/Month

    Contribute to help millions

    Made with ❤️ in India. Trademarks belong to their respective owners. All rights reserved © 2024 Info Edge (India) Ltd.

    Follow us
    • Youtube
    • Instagram
    • LinkedIn
    • Facebook
    • Twitter