Data Engineer
1000+ Data Engineer Interview Questions and Answers
Q101. Write a python program for list/dictionary (comprehensions)
Python program for list/dictionary comprehensions
List comprehension: [expression for item in iterable]
Dictionary comprehension: {key_expression: value_expression for item in iterable}
Example: squares = [x**2 for x in range(10)]
Example: dict_squares = {x: x**2 for x in range(10)}
Q102. Rate yourself out of 5 in Pyspark , Python and SQL
I would rate myself 4 in Pyspark, 5 in Python, and 4 in SQL.
Strong proficiency in Python programming language
Experience in working with Pyspark for big data processing
Proficient in writing complex SQL queries for data manipulation
Familiarity with optimizing queries for performance
Hands-on experience in data engineering projects
Q103. Write a SQL to get Student names who got marks>45 in each subject from Student table
SQL query to retrieve student names with marks > 45 in each subject
Use GROUP BY and HAVING clauses to filter students with marks > 45 in each subject
Join Student table with Marks table on student_id to get marks for each student
Select student names from Student table based on the conditions
Q104. Provide a approach to develop ETL pipeline where csv file is dropped in S3 and transform through airflow and load to snowflake.
Develop ETL pipeline using Airflow to process CSV files in S3 and load data into Snowflake.
Set up an S3 sensor in Airflow to detect when a new CSV file is dropped in the specified bucket.
Create a custom Python operator in Airflow to read the CSV file from S3, perform necessary transformations, and load data into Snowflake.
Use SnowflakeHook in Airflow to establish connection with Snowflake and execute SQL queries to load data.
Schedule the ETL pipeline in Airflow to run at spec...read more
Q105. Calculate second highest salary using SQL as well as pyspark.
Calculate second highest salary using SQL and pyspark
Use SQL query with ORDER BY and LIMIT to get the second highest salary
In pyspark, use orderBy() and take() functions to achieve the same result
Q106. How do you optimize SQL queries?
Optimizing SQL queries involves using indexes, avoiding unnecessary joins, and optimizing the query structure.
Use indexes on columns frequently used in WHERE clauses
Avoid using SELECT * and only retrieve necessary columns
Optimize joins by using INNER JOIN instead of OUTER JOIN when possible
Use EXPLAIN to analyze query performance and make necessary adjustments
Share interview questions and help millions of jobseekers 🌟
Q107. Find second max salary 9f employee by department using python pandas dataframe
Find second max salary of employee by department using pandas dataframe
Group the dataframe by department
Sort the salaries in descending order
Select the second highest salary for each department
Q108. What is the difference between UNION and UNION ALL?
UNION combines and removes duplicates, UNION ALL combines without removing duplicates.
UNION merges the results of two or more SELECT statements into a single result set.
UNION ALL does the same as UNION, but it does not remove duplicates.
UNION is more resource-intensive than UNION ALL because it performs a sort operation.
UNION ALL is faster than UNION when all records are required.
Example: SELECT column1 FROM table1 UNION SELECT column1 FROM table2;
Example: SELECT column1 FROM...read more
Data Engineer Jobs
Q109. 3 sql queries. architecture for a data pipeline to display hour per activity in last 24 hra on a smartwatch .
Design a data pipeline to display hourly activity on a smartwatch for the last 24 hours using SQL queries.
Create a table to store activity data with columns for timestamp and activity type.
Use a SQL query to aggregate activity data by hour for the last 24 hours.
Display the results on the smartwatch using a suitable visualization.
Q110. How to deal with data quality issues
Data quality issues can be dealt with by identifying the root cause, implementing data validation checks, and establishing data governance policies.
Identify the root cause of the data quality issue
Implement data validation checks to prevent future issues
Establish data governance policies to ensure data accuracy and consistency
Regularly monitor and audit data quality
Involve stakeholders in the data quality process
Use data profiling and cleansing tools
Ensure data security and p...read more
Q111. What is ETL, and can you provide examples from your project to illustrate its application?
ETL stands for Extract, Transform, Load. It is a process used to extract data from various sources, transform it into a consistent format, and load it into a target database.
Extract: Retrieving data from different sources such as databases, APIs, or files.
Transform: Cleaning, filtering, and structuring the extracted data to fit the target database schema.
Load: Loading the transformed data into the target database for analysis or reporting.
Example: Extracting customer informat...read more
Q112. What is the difference between RDD (Resilient Distributed Datasets) and DataFrame in Apache Spark?
RDD is a low-level abstraction representing a distributed collection of objects, while DataFrame is a higher-level abstraction representing a distributed collection of data organized into named columns.
RDD is more suitable for unstructured data and low-level transformations, while DataFrame is more suitable for structured data and high-level abstractions.
DataFrames provide optimizations like query optimization and code generation, while RDDs do not.
DataFrames support SQL quer...read more
Q113. What is the difference between the reduceBy and groupBy transformations in Apache Spark?
reduceBy is used to aggregate data based on key, while groupBy is used to group data based on key.
reduceBy is a transformation that combines the values of each key using an associative function and a neutral 'zero value'.
groupBy is a transformation that groups the data based on a key and returns a grouped data set.
reduceBy is more efficient for aggregating data as it reduces the data before shuffling, while groupBy shuffles all the data before grouping.
reduceBy is typically u...read more
Q114. ETL - How to do full load in SSIS, mention the steps
To perform a full load in SSIS, you can use the Data Flow Task with a source and destination component.
Create a Data Flow Task in the Control Flow tab of the SSIS package.
Add a source component to extract data from the source system.
Add a destination component to load data into the destination system.
Map the columns from the source to the destination.
Run the package to execute the full load.
Q115. ETL- how to do the incremental load in ADF and in SSIS
Incremental load in ADF and SSIS involves identifying new or updated data and loading only those changes.
In ADF, use watermark columns to track the last loaded value and filter data based on this value
In SSIS, use CDC (Change Data Capture) components or custom scripts to identify new or updated data
Both ADF and SSIS support incremental loading by comparing source and target data to determine changes
Q116. how to copy data without using multiple activities. Dynamically using loops/ parameterization.
Use a single activity with dynamic parameterization and loops to copy data.
Use a loop to iterate through the data source and destination locations.
Parameterize the source and destination locations to dynamically copy data.
Utilize a scripting language like Python or PowerShell to implement the logic.
Example: Use a Python script with a loop to copy files from one folder to another.
Example: Use PowerShell script with dynamic parameters to copy data from one database to another.
Q117. how to insert non-duplicate data into target table. how many ways we can do.
To insert non-duplicate data into a target table, you can use methods like using a unique constraint, using a merge statement, or using a temporary table.
Use a unique constraint on the target table to prevent duplicate entries.
Use a merge statement to insert data into the target table only if it does not already exist.
Use a temporary table to store the new data, then insert only the non-duplicate records into the target table.
Q118. Have you work on Lambda Function Explain it?
Lambda function is a serverless computing service that runs code in response to events and automatically manages the computing resources required.
Lambda functions are event-driven and can be triggered by various AWS services such as S3, DynamoDB, API Gateway, etc.
They are written in languages like Python, Node.js, Java, etc.
Lambda functions are scalable and cost-effective as you only pay for the compute time you consume.
They can be used for data processing, real-time file pro...read more
Q119. Explain databricks dlt, and when will you use batch vs streaming?
Databricks DLT is a unified data management platform for batch and streaming processing.
Databricks DLT (Delta Lake Table) is a storage layer that brings ACID transactions to Apache Spark and big data workloads.
Batch processing is used when data is collected over a period of time and processed in large chunks, while streaming processing is used for real-time data processing.
Use batch processing for historical data analysis, ETL jobs, and periodic reporting. Use streaming proce...read more
Q120. How to split staged data’s row into separate columns
Use SQL functions like SUBSTRING and CHARINDEX to split staged data's row into separate columns
Use SUBSTRING function to extract specific parts of the row
Use CHARINDEX function to find the position of a specific character in the row
Use CASE statements to create separate columns based on conditions
Q121. what is difference repartition and coalesce
Repartition increases the number of partitions in a DataFrame, while coalesce reduces the number of partitions without shuffling data.
Repartition involves a full shuffle of the data across the cluster, which can be expensive.
Coalesce minimizes data movement by only creating new partitions if necessary.
Repartition is typically used when increasing parallelism or evenly distributing data, while coalesce is used for reducing the number of partitions without a full shuffle.
Exampl...read more
Q122. 4. How to connect SQL server to databricks
To connect SQL server to Databricks, use JDBC/ODBC drivers and configure the connection settings.
Install the appropriate JDBC/ODBC driver for SQL server
Configure the connection settings in Databricks
Use the JDBC/ODBC driver to establish the connection
Q123. diff btw view and indexed view, pipeline design process, where to monitor pipelibe failures and how to fix
View vs indexed view, pipeline design process, monitoring pipeline failures and fixing
View is a virtual table based on a SELECT query, while indexed view is a view with a clustered index for faster retrieval
Pipeline design process involves defining data sources, transformations, and destinations
Monitor pipeline failures using logging tools like Apache NiFi or Airflow
Fix pipeline failures by identifying the root cause, adjusting configurations, or updating dependencies
Q124. What are types of structured query languages
Types of structured query languages include SQL, PL/SQL, T-SQL, and others.
SQL (Structured Query Language) - widely used for managing relational databases
PL/SQL (Procedural Language/SQL) - Oracle's proprietary extension for SQL
T-SQL (Transact-SQL) - Microsoft's extension for SQL used in SQL Server
Others - include languages like MySQL, PostgreSQL, SQLite, etc.
Q125. Explain briefly how gcp is highly scalable and manage the data ?
GCP offers scalable data solutions through managed services, allowing seamless data storage, processing, and analytics.
Auto-scaling: GCP services like Google Kubernetes Engine automatically adjust resources based on demand.
BigQuery: A serverless data warehouse that scales to handle petabytes of data without manual intervention.
Cloud Storage: Provides scalable object storage with high availability and durability, suitable for unstructured data.
Dataflow: A fully managed service...read more
Q126. What are some SQL queries that utilize joins and window functions?
SQL queries using joins and window functions
Use INNER JOIN to combine rows from two or more tables based on a related column
Use WINDOW functions like ROW_NUMBER() to assign a unique sequential integer to each row within a partition
Example: SELECT column1, column2, ROW_NUMBER() OVER(PARTITION BY column1 ORDER BY column2) AS row_num FROM table_name
Q127. Are you familiar with Celebal Technologies
Celebal Technologies is a technology company specializing in data engineering and analytics solutions.
Celebal Technologies is known for providing data engineering and analytics solutions.
They offer services such as data integration, data warehousing, and data visualization.
Celebal Technologies works with clients across various industries to help them optimize their data processes.
They have expertise in technologies like Hadoop, Spark, and Python for data engineering.
The compa...read more
Q128. What are the technologies you have worked on?
I have worked on various technologies including Hadoop, Spark, SQL, Python, and AWS.
Experience with Hadoop and Spark for big data processing
Proficient in SQL for data querying and manipulation
Skilled in Python for data analysis and scripting
Familiarity with AWS services such as S3, EC2, and EMR
Knowledge of data warehousing and ETL processes
Q129. How to remove Duplicates in Data frame using pyspark?
Use dropDuplicates() function in pyspark to remove duplicates in a data frame.
Use dropDuplicates() function on the data frame to remove duplicates based on all columns.
Specify subset of columns to remove duplicates based on specific columns.
Use the distinct() function to remove duplicates and keep only distinct rows.
Q130. How you do the alerting mechanism in adf for failed pipelines
Alerting mechanism in ADF for failed pipelines involves setting up alerts in Azure Monitor and configuring email notifications.
Set up alerts in Azure Monitor for monitoring pipeline runs
Configure alert rules to trigger notifications when a pipeline run fails
Use Azure Logic Apps to send email notifications for failed pipeline runs
Q131. What is difference between Primary and unique key in dbms?
Primary key uniquely identifies a record in a table, while unique key ensures all values in a column are distinct.
Primary key does not allow NULL values, while unique key allows one NULL value.
A table can have only one primary key, but multiple unique keys.
Primary key is a combination of unique and not null constraints.
Primary key is used to establish relationships between tables, while unique key is used to enforce uniqueness in a column.
Q132. what operator is used in composer to move data from gcs to bq
The operator used in Composer to move data from GCS to BigQuery is the GCS to BigQuery operator.
The GCS to BigQuery operator is used in Apache Airflow, which is the underlying technology of Composer.
This operator allows you to transfer data from Google Cloud Storage (GCS) to BigQuery.
You can specify the source and destination parameters in the operator to define the data transfer process.
Q133. How would you build a pipeline to connect http source and bring data in adls
Build a pipeline to connect http source and bring data in adls
Set up a data ingestion tool like Apache NiFi or Azure Data Factory to pull data from the http source
Transform the data as needed using tools like Apache Spark or Azure Databricks
Store the data in Azure Data Lake Storage (ADLS) for further processing and analysis
Q134. How have you managed experiences involving strict timelines and deliverables?
I have successfully managed experiences involving strict timelines and deliverables by prioritizing tasks, setting clear goals, and communicating effectively with team members.
Prioritizing tasks based on deadlines and importance
Setting clear goals and milestones to track progress
Communicating effectively with team members to ensure everyone is on the same page
Proactively identifying potential roadblocks and finding solutions to overcome them
Q135. Python vs Java advantages and disadvantages,Kubernetes
Python is more flexible and easier to learn, while Java is more performant and better for large-scale projects. Kubernetes is a popular container orchestration tool.
Python is more concise and easier to read/write than Java
Java is more performant and better for large-scale projects
Kubernetes is a popular container orchestration tool used for managing containerized applications
Kubernetes provides features like automatic scaling, self-healing, and rolling updates
Python is often ...read more
Q136. How will you design/configure a cluster if you have given 10 petabytes of data.
Designing/configuring a cluster for 10 petabytes of data involves considerations for storage capacity, processing power, network bandwidth, and fault tolerance.
Consider using a distributed file system like HDFS or object storage like Amazon S3 to store and manage the large volume of data.
Implement a scalable processing framework like Apache Spark or Hadoop to efficiently process and analyze the data in parallel.
Utilize a cluster management system like Apache Mesos or Kubernet...read more
Q137. What is the SQL query to find the third highest salary from a given table?
Use SQL query with ORDER BY and LIMIT to find the third highest salary from a table.
Use ORDER BY clause to sort salaries in descending order
Use LIMIT 1 OFFSET 2 to skip the first two highest salaries
Example: SELECT salary FROM employees ORDER BY salary DESC LIMIT 1 OFFSET 2
Q138. How you will run a child notebook into a parent notebook using dbutils command
Use dbutils.notebook.run() command to run a child notebook in a parent notebook
Use dbutils.notebook.run() command with the path to the child notebook and any parameters needed
Ensure that the child notebook is accessible and has necessary permissions
Handle any return values or errors from the child notebook appropriately
Q139. What is difference between hadoop and spark? Difference between coalesce and repartition? Sql query HDFS
Hadoop is a distributed storage and processing framework, while Spark is a fast and general-purpose cluster computing system.
Hadoop is primarily used for batch processing of large datasets, while Spark is known for its in-memory processing capabilities.
Hadoop uses MapReduce for processing data, while Spark uses Resilient Distributed Datasets (RDDs).
Coalesce is used to reduce the number of partitions in a DataFrame or RDD without shuffling data, while repartition is used to in...read more
Q140. What will be spark configuration to process 2 gb of data
Set spark configuration with appropriate memory and cores for efficient processing of 2 GB data
Increase executor memory and cores to handle larger data size
Adjust spark memory overhead to prevent out of memory errors
Optimize shuffle partitions for better performance
Q141. Assume below Dataframes DF1 (UserID,Name) DF2 (UserID,PageID,Timestamp,Events) Write code to Join the DF's, Count the No of Events and filter Users with 0 Events
Join DF's, count events, filter users with 0 events
Use join operation to combine DF1 and DF2 on UserID
Group by UserID and count the number of events
Filter out users with 0 events
Q142. What is Re-Partition and Coalesce? How are these used?
Re-Partition and Coalesce are methods used to control the number of partitions in a dataset in Apache Spark.
Re-Partition is used to increase or decrease the number of partitions in a dataset by shuffling the data across the cluster.
Coalesce is used to decrease the number of partitions in a dataset without shuffling the data, which can improve performance.
Re-Partition is typically used when there is a need to increase parallelism or balance the data distribution, while Coalesc...read more
Q143. 2 types of modes for Spark architecture ?
The two types of modes for Spark architecture are standalone mode and cluster mode.
Standalone mode: Spark runs on a single machine with a single JVM and is suitable for development and testing.
Cluster mode: Spark runs on a cluster of machines managed by a cluster manager like YARN or Mesos for production workloads.
Q144. How would you build a pipeline for a Machine learning project?
To build a pipeline for a Machine learning project, you need to collect data, preprocess it, train the model, evaluate its performance, and deploy it.
Collect relevant data from various sources
Preprocess the data by cleaning, transforming, and normalizing it
Split the data into training and testing sets
Train the machine learning model using the training data
Evaluate the model's performance using the testing data
Fine-tune the model if necessary
Deploy the model into production en...read more
Q145. What are your experiences with cloud services, particularly with Amazon Web Services (AWS)?
I have extensive experience working with AWS, including setting up and managing various cloud services.
Implemented data pipelines using AWS Glue for ETL processes
Utilized AWS S3 for storing and managing large datasets
Deployed and managed EC2 instances for running data processing tasks
Used AWS Lambda for serverless computing and automation
Worked with AWS Redshift for data warehousing
Q146. what is difference between union vs union all
Union combines and removes duplicates, Union All combines all rows including duplicates.
Union merges two tables and removes duplicates
Union All merges two tables and includes duplicates
Union is slower than Union All as it removes duplicates
Syntax: SELECT column1, column2 FROM table1 UNION/UNION ALL SELECT column1, column2 FROM table2
Example: SELECT name FROM table1 UNION SELECT name FROM table2
Q147. What are transformations, and how many types of transformations exist?
Transformations are operations performed on data to convert it from one form to another. There are mainly two types of transformations: narrow and wide.
Transformations are operations performed on data to convert it from one form to another.
Narrow transformations are those where each input partition will contribute to only one output partition, e.g., map, filter.
Wide transformations are those where each input partition may contribute to multiple output partitions, e.g., groupB...read more
Q148. What are different type of joins available in Databricks?
Different types of joins available in Databricks include inner join, outer join, left join, right join, and cross join.
Inner join: Returns only the rows that have matching values in both tables.
Outer join: Returns all rows when there is a match in either table.
Left join: Returns all rows from the left table and the matched rows from the right table.
Right join: Returns all rows from the right table and the matched rows from the left table.
Cross join: Returns the Cartesian prod...read more
Q149. Write code to print reverse of a sentence word by word.
Q150. What tools and frameworks you have used and details about that
I have experience with tools like Apache Spark, Hadoop, and Kafka for data processing and streaming.
Apache Spark for large-scale data processing
Hadoop for distributed storage and processing
Kafka for real-time data streaming
Interview Questions of Similar Designations
Top Interview Questions for Data Engineer Related Skills
Interview experiences of popular companies
Calculate your in-hand salary
Confused about how your in-hand salary is calculated? Enter your annual salary (CTC) and get your in-hand salary
Reviews
Interviews
Salaries
Users/Month