Data Engineer
1000+ Data Engineer Interview Questions and Answers
Q101. how to remove duplicate rows from bigquery? find the month of a given date in bigquery.
To remove duplicate rows from BigQuery, use the DISTINCT keyword. To find the month of a given date, use the EXTRACT function.
To remove duplicate rows, use SELECT DISTINCT * FROM table_name;
To find the month of a given date, use SELECT EXTRACT(MONTH FROM date_column) AS month_name FROM table_name;
Make sure to replace 'table_name' and 'date_column' with the appropriate values in your query.
Q102. write sql code to get the city1 city2 distance of table if city1 and city2 tables can repeat
SQL code to get the city1 city2 distance of table with repeating city1 and city2 values
Use a self join on the table to match city1 and city2
Calculate the distance between the cities using appropriate formula
Consider using a subquery if needed
Q103. Difference between logical plan and physical plan in pyspark?
Logical plan represents the high-level abstract representation of the computation to be performed, while physical plan represents the actual execution plan with specific optimizations and details.
Logical plan is a high-level abstract representation of the computation to be performed.
Physical plan is the actual execution plan with specific optimizations and details.
Logical plan is created first and then optimized to generate the physical plan.
Physical plan includes details lik...read more
Q104. Python dataframes and how we use them in project and where at time
Python dataframes are used to organize and manipulate data in a tabular format.
Dataframes are created using the pandas library in Python.
They allow for easy manipulation of data, such as filtering, sorting, and grouping.
Dataframes can be used in various projects, such as data analysis, machine learning, and data visualization.
Examples of using dataframes include analyzing sales data, predicting customer behavior, and visualizing stock market trends.
Q105. What is afd? build dynamic pipeline spark arcticture sql data flow
AFD is not a commonly used term in data engineering. Can you provide more context?
Q106. how do you create calculated columns in pandas
Calculated columns in pandas can be created using the assign() method or by directly assigning a new column based on existing columns.
Use the assign() method to create calculated columns by specifying the new column name and the calculation formula.
Directly assign a new column by referencing existing columns and applying the desired calculation.
Example: df.assign(new_column = df['column1'] + df['column2'])
Example: df['new_column'] = df['column1'] * 2
Share interview questions and help millions of jobseekers 🌟
Q107. What is an Normal Distribution in an Dataset
A normal distribution in a dataset is a bell-shaped curve where the data is symmetrically distributed around the mean.
Data is evenly distributed around the mean
68% of data falls within one standard deviation of the mean
95% of data falls within two standard deviations of the mean
Examples: heights of people, test scores in a class
Q108. Write a python program for list/dictionary (comprehensions)
Python program for list/dictionary comprehensions
List comprehension: [expression for item in iterable]
Dictionary comprehension: {key_expression: value_expression for item in iterable}
Example: squares = [x**2 for x in range(10)]
Example: dict_squares = {x: x**2 for x in range(10)}
Data Engineer Jobs
Q109. Rate yourself out of 5 in Pyspark , Python and SQL
I would rate myself 4 in Pyspark, 5 in Python, and 4 in SQL.
Strong proficiency in Python programming language
Experience in working with Pyspark for big data processing
Proficient in writing complex SQL queries for data manipulation
Familiarity with optimizing queries for performance
Hands-on experience in data engineering projects
Q110. Write a SQL to get Student names who got marks>45 in each subject from Student table
SQL query to retrieve student names with marks > 45 in each subject
Use GROUP BY and HAVING clauses to filter students with marks > 45 in each subject
Join Student table with Marks table on student_id to get marks for each student
Select student names from Student table based on the conditions
Q111. Provide a approach to develop ETL pipeline where csv file is dropped in S3 and transform through airflow and load to snowflake.
Develop ETL pipeline using Airflow to process CSV files in S3 and load data into Snowflake.
Set up an S3 sensor in Airflow to detect when a new CSV file is dropped in the specified bucket.
Create a custom Python operator in Airflow to read the CSV file from S3, perform necessary transformations, and load data into Snowflake.
Use SnowflakeHook in Airflow to establish connection with Snowflake and execute SQL queries to load data.
Schedule the ETL pipeline in Airflow to run at spec...read more
Q112. Calculate second highest salary using SQL as well as pyspark.
Calculate second highest salary using SQL and pyspark
Use SQL query with ORDER BY and LIMIT to get the second highest salary
In pyspark, use orderBy() and take() functions to achieve the same result
Q113. How do you optimize SQL queries?
Optimizing SQL queries involves using indexes, avoiding unnecessary joins, and optimizing the query structure.
Use indexes on columns frequently used in WHERE clauses
Avoid using SELECT * and only retrieve necessary columns
Optimize joins by using INNER JOIN instead of OUTER JOIN when possible
Use EXPLAIN to analyze query performance and make necessary adjustments
Q114. Find second max salary 9f employee by department using python pandas dataframe
Find second max salary of employee by department using pandas dataframe
Group the dataframe by department
Sort the salaries in descending order
Select the second highest salary for each department
Q115. What is ETL, and can you provide examples from your project to illustrate its application?
ETL stands for Extract, Transform, Load. It is a process used to extract data from various sources, transform it into a consistent format, and load it into a target database.
Extract: Retrieving data from different sources such as databases, APIs, or files.
Transform: Cleaning, filtering, and structuring the extracted data to fit the target database schema.
Load: Loading the transformed data into the target database for analysis or reporting.
Example: Extracting customer informat...read more
Q116. What is the difference between RDD (Resilient Distributed Datasets) and DataFrame in Apache Spark?
RDD is a low-level abstraction representing a distributed collection of objects, while DataFrame is a higher-level abstraction representing a distributed collection of data organized into named columns.
RDD is more suitable for unstructured data and low-level transformations, while DataFrame is more suitable for structured data and high-level abstractions.
DataFrames provide optimizations like query optimization and code generation, while RDDs do not.
DataFrames support SQL quer...read more
Q117. What is the difference between the reduceBy and groupBy transformations in Apache Spark?
reduceBy is used to aggregate data based on key, while groupBy is used to group data based on key.
reduceBy is a transformation that combines the values of each key using an associative function and a neutral 'zero value'.
groupBy is a transformation that groups the data based on a key and returns a grouped data set.
reduceBy is more efficient for aggregating data as it reduces the data before shuffling, while groupBy shuffles all the data before grouping.
reduceBy is typically u...read more
Q118. What is the difference between UNION and UNION ALL?
UNION combines and removes duplicates, UNION ALL combines without removing duplicates.
UNION merges the results of two or more SELECT statements into a single result set.
UNION ALL does the same as UNION, but it does not remove duplicates.
UNION is more resource-intensive than UNION ALL because it performs a sort operation.
UNION ALL is faster than UNION when all records are required.
Example: SELECT column1 FROM table1 UNION SELECT column1 FROM table2;
Example: SELECT column1 FROM...read more
Q119. 3 sql queries. architecture for a data pipeline to display hour per activity in last 24 hra on a smartwatch .
Design a data pipeline to display hourly activity on a smartwatch for the last 24 hours using SQL queries.
Create a table to store activity data with columns for timestamp and activity type.
Use a SQL query to aggregate activity data by hour for the last 24 hours.
Display the results on the smartwatch using a suitable visualization.
Q120. How to deal with data quality issues
Data quality issues can be dealt with by identifying the root cause, implementing data validation checks, and establishing data governance policies.
Identify the root cause of the data quality issue
Implement data validation checks to prevent future issues
Establish data governance policies to ensure data accuracy and consistency
Regularly monitor and audit data quality
Involve stakeholders in the data quality process
Use data profiling and cleansing tools
Ensure data security and p...read more
Q121. ETL - How to do full load in SSIS, mention the steps
To perform a full load in SSIS, you can use the Data Flow Task with a source and destination component.
Create a Data Flow Task in the Control Flow tab of the SSIS package.
Add a source component to extract data from the source system.
Add a destination component to load data into the destination system.
Map the columns from the source to the destination.
Run the package to execute the full load.
Q122. ETL- how to do the incremental load in ADF and in SSIS
Incremental load in ADF and SSIS involves identifying new or updated data and loading only those changes.
In ADF, use watermark columns to track the last loaded value and filter data based on this value
In SSIS, use CDC (Change Data Capture) components or custom scripts to identify new or updated data
Both ADF and SSIS support incremental loading by comparing source and target data to determine changes
Q123. how to copy data without using multiple activities. Dynamically using loops/ parameterization.
Use a single activity with dynamic parameterization and loops to copy data.
Use a loop to iterate through the data source and destination locations.
Parameterize the source and destination locations to dynamically copy data.
Utilize a scripting language like Python or PowerShell to implement the logic.
Example: Use a Python script with a loop to copy files from one folder to another.
Example: Use PowerShell script with dynamic parameters to copy data from one database to another.
Q124. how to insert non-duplicate data into target table. how many ways we can do.
To insert non-duplicate data into a target table, you can use methods like using a unique constraint, using a merge statement, or using a temporary table.
Use a unique constraint on the target table to prevent duplicate entries.
Use a merge statement to insert data into the target table only if it does not already exist.
Use a temporary table to store the new data, then insert only the non-duplicate records into the target table.
Q125. Have you work on Lambda Function Explain it?
Lambda function is a serverless computing service that runs code in response to events and automatically manages the computing resources required.
Lambda functions are event-driven and can be triggered by various AWS services such as S3, DynamoDB, API Gateway, etc.
They are written in languages like Python, Node.js, Java, etc.
Lambda functions are scalable and cost-effective as you only pay for the compute time you consume.
They can be used for data processing, real-time file pro...read more
Q126. Explain databricks dlt, and when will you use batch vs streaming?
Databricks DLT is a unified data management platform for batch and streaming processing.
Databricks DLT (Delta Lake Table) is a storage layer that brings ACID transactions to Apache Spark and big data workloads.
Batch processing is used when data is collected over a period of time and processed in large chunks, while streaming processing is used for real-time data processing.
Use batch processing for historical data analysis, ETL jobs, and periodic reporting. Use streaming proce...read more
Q127. How to split staged data’s row into separate columns
Use SQL functions like SUBSTRING and CHARINDEX to split staged data's row into separate columns
Use SUBSTRING function to extract specific parts of the row
Use CHARINDEX function to find the position of a specific character in the row
Use CASE statements to create separate columns based on conditions
Q128. what is difference repartition and coalesce
Repartition increases the number of partitions in a DataFrame, while coalesce reduces the number of partitions without shuffling data.
Repartition involves a full shuffle of the data across the cluster, which can be expensive.
Coalesce minimizes data movement by only creating new partitions if necessary.
Repartition is typically used when increasing parallelism or evenly distributing data, while coalesce is used for reducing the number of partitions without a full shuffle.
Exampl...read more
Q129. 4. How to connect SQL server to databricks
To connect SQL server to Databricks, use JDBC/ODBC drivers and configure the connection settings.
Install the appropriate JDBC/ODBC driver for SQL server
Configure the connection settings in Databricks
Use the JDBC/ODBC driver to establish the connection
Q130. diff btw view and indexed view, pipeline design process, where to monitor pipelibe failures and how to fix
View vs indexed view, pipeline design process, monitoring pipeline failures and fixing
View is a virtual table based on a SELECT query, while indexed view is a view with a clustered index for faster retrieval
Pipeline design process involves defining data sources, transformations, and destinations
Monitor pipeline failures using logging tools like Apache NiFi or Airflow
Fix pipeline failures by identifying the root cause, adjusting configurations, or updating dependencies
Q131. What are types of structured query languages
Types of structured query languages include SQL, PL/SQL, T-SQL, and others.
SQL (Structured Query Language) - widely used for managing relational databases
PL/SQL (Procedural Language/SQL) - Oracle's proprietary extension for SQL
T-SQL (Transact-SQL) - Microsoft's extension for SQL used in SQL Server
Others - include languages like MySQL, PostgreSQL, SQLite, etc.
Q132. What are some SQL queries that utilize joins and window functions?
SQL queries using joins and window functions
Use INNER JOIN to combine rows from two or more tables based on a related column
Use WINDOW functions like ROW_NUMBER() to assign a unique sequential integer to each row within a partition
Example: SELECT column1, column2, ROW_NUMBER() OVER(PARTITION BY column1 ORDER BY column2) AS row_num FROM table_name
Q133. Are you familiar with Celebal Technologies
Celebal Technologies is a technology company specializing in data engineering and analytics solutions.
Celebal Technologies is known for providing data engineering and analytics solutions.
They offer services such as data integration, data warehousing, and data visualization.
Celebal Technologies works with clients across various industries to help them optimize their data processes.
They have expertise in technologies like Hadoop, Spark, and Python for data engineering.
The compa...read more
Q134. What are the technologies you have worked on?
I have worked on various technologies including Hadoop, Spark, SQL, Python, and AWS.
Experience with Hadoop and Spark for big data processing
Proficient in SQL for data querying and manipulation
Skilled in Python for data analysis and scripting
Familiarity with AWS services such as S3, EC2, and EMR
Knowledge of data warehousing and ETL processes
Q135. How to remove Duplicates in Data frame using pyspark?
Use dropDuplicates() function in pyspark to remove duplicates in a data frame.
Use dropDuplicates() function on the data frame to remove duplicates based on all columns.
Specify subset of columns to remove duplicates based on specific columns.
Use the distinct() function to remove duplicates and keep only distinct rows.
Q136. How you do the alerting mechanism in adf for failed pipelines
Alerting mechanism in ADF for failed pipelines involves setting up alerts in Azure Monitor and configuring email notifications.
Set up alerts in Azure Monitor for monitoring pipeline runs
Configure alert rules to trigger notifications when a pipeline run fails
Use Azure Logic Apps to send email notifications for failed pipeline runs
Q137. What is difference between Primary and unique key in dbms?
Primary key uniquely identifies a record in a table, while unique key ensures all values in a column are distinct.
Primary key does not allow NULL values, while unique key allows one NULL value.
A table can have only one primary key, but multiple unique keys.
Primary key is a combination of unique and not null constraints.
Primary key is used to establish relationships between tables, while unique key is used to enforce uniqueness in a column.
Q138. what operator is used in composer to move data from gcs to bq
The operator used in Composer to move data from GCS to BigQuery is the GCS to BigQuery operator.
The GCS to BigQuery operator is used in Apache Airflow, which is the underlying technology of Composer.
This operator allows you to transfer data from Google Cloud Storage (GCS) to BigQuery.
You can specify the source and destination parameters in the operator to define the data transfer process.
Q139. How would you build a pipeline to connect http source and bring data in adls
Build a pipeline to connect http source and bring data in adls
Set up a data ingestion tool like Apache NiFi or Azure Data Factory to pull data from the http source
Transform the data as needed using tools like Apache Spark or Azure Databricks
Store the data in Azure Data Lake Storage (ADLS) for further processing and analysis
Q140. How have you managed experiences involving strict timelines and deliverables?
I have successfully managed experiences involving strict timelines and deliverables by prioritizing tasks, setting clear goals, and communicating effectively with team members.
Prioritizing tasks based on deadlines and importance
Setting clear goals and milestones to track progress
Communicating effectively with team members to ensure everyone is on the same page
Proactively identifying potential roadblocks and finding solutions to overcome them
Q141. Python vs Java advantages and disadvantages,Kubernetes
Python is more flexible and easier to learn, while Java is more performant and better for large-scale projects. Kubernetes is a popular container orchestration tool.
Python is more concise and easier to read/write than Java
Java is more performant and better for large-scale projects
Kubernetes is a popular container orchestration tool used for managing containerized applications
Kubernetes provides features like automatic scaling, self-healing, and rolling updates
Python is often ...read more
Q142. What is the SQL query to find the third highest salary from a given table?
Use SQL query with ORDER BY and LIMIT to find the third highest salary from a table.
Use ORDER BY clause to sort salaries in descending order
Use LIMIT 1 OFFSET 2 to skip the first two highest salaries
Example: SELECT salary FROM employees ORDER BY salary DESC LIMIT 1 OFFSET 2
Q143. How will you design/configure a cluster if you have given 10 petabytes of data.
Designing/configuring a cluster for 10 petabytes of data involves considerations for storage capacity, processing power, network bandwidth, and fault tolerance.
Consider using a distributed file system like HDFS or object storage like Amazon S3 to store and manage the large volume of data.
Implement a scalable processing framework like Apache Spark or Hadoop to efficiently process and analyze the data in parallel.
Utilize a cluster management system like Apache Mesos or Kubernet...read more
Q144. How you will run a child notebook into a parent notebook using dbutils command
Use dbutils.notebook.run() command to run a child notebook in a parent notebook
Use dbutils.notebook.run() command with the path to the child notebook and any parameters needed
Ensure that the child notebook is accessible and has necessary permissions
Handle any return values or errors from the child notebook appropriately
Q145. What will be spark configuration to process 2 gb of data
Set spark configuration with appropriate memory and cores for efficient processing of 2 GB data
Increase executor memory and cores to handle larger data size
Adjust spark memory overhead to prevent out of memory errors
Optimize shuffle partitions for better performance
Q146. What is difference between hadoop and spark? Difference between coalesce and repartition? Sql query HDFS
Hadoop is a distributed storage and processing framework, while Spark is a fast and general-purpose cluster computing system.
Hadoop is primarily used for batch processing of large datasets, while Spark is known for its in-memory processing capabilities.
Hadoop uses MapReduce for processing data, while Spark uses Resilient Distributed Datasets (RDDs).
Coalesce is used to reduce the number of partitions in a DataFrame or RDD without shuffling data, while repartition is used to in...read more
Q147. Assume below Dataframes DF1 (UserID,Name) DF2 (UserID,PageID,Timestamp,Events) Write code to Join the DF's, Count the No of Events and filter Users with 0 Events
Join DF's, count events, filter users with 0 events
Use join operation to combine DF1 and DF2 on UserID
Group by UserID and count the number of events
Filter out users with 0 events
Q148. 2 types of modes for Spark architecture ?
The two types of modes for Spark architecture are standalone mode and cluster mode.
Standalone mode: Spark runs on a single machine with a single JVM and is suitable for development and testing.
Cluster mode: Spark runs on a cluster of machines managed by a cluster manager like YARN or Mesos for production workloads.
Q149. What is Re-Partition and Coalesce? How are these used?
Re-Partition and Coalesce are methods used to control the number of partitions in a dataset in Apache Spark.
Re-Partition is used to increase or decrease the number of partitions in a dataset by shuffling the data across the cluster.
Coalesce is used to decrease the number of partitions in a dataset without shuffling the data, which can improve performance.
Re-Partition is typically used when there is a need to increase parallelism or balance the data distribution, while Coalesc...read more
Q150. How would you build a pipeline for a Machine learning project?
To build a pipeline for a Machine learning project, you need to collect data, preprocess it, train the model, evaluate its performance, and deploy it.
Collect relevant data from various sources
Preprocess the data by cleaning, transforming, and normalizing it
Split the data into training and testing sets
Train the machine learning model using the training data
Evaluate the model's performance using the testing data
Fine-tune the model if necessary
Deploy the model into production en...read more
Interview Questions of Similar Designations
Top Interview Questions for Data Engineer Related Skills
Interview experiences of popular companies
Calculate your in-hand salary
Confused about how your in-hand salary is calculated? Enter your annual salary (CTC) and get your in-hand salary
Reviews
Interviews
Salaries
Users/Month