Data Engineer
1000+ Data Engineer Interview Questions and Answers

Asked in Publicis Sapient

Q. What volume of data have you handled in your Proof of Concepts?
I have handled terabytes of data in my POCs, including data from various sources and formats.
Handled terabytes of data in POCs
Worked with data from various sources and formats
Used tools like Hadoop, Spark, and SQL for data processing

Asked in Meditab Software

Q. Given a Python list containing mixed positive and negative numbers, how would you rearrange the list such that positive and negative numbers alternate?
Separate a list of mixed positive and negative numbers into two distinct lists.
Use list comprehensions to filter positive and negative numbers.
Example: For input [-1, 2, -3, 4], positives = [2, 4], negatives = [-1, -3].
Combine the two lists if needed, maintaining the order.
Consider edge cases like empty lists or lists with all positives/negatives.

Asked in e2open

Q. How do you delete duplicate rows from a table?
To delete duplicate rows from a table, use the DISTINCT keyword or GROUP BY clause.
Use the DISTINCT keyword to select unique rows from the table.
Use the GROUP BY clause to group the rows by a specific column and select the unique rows.
Use the DELETE statement with a subquery to delete the duplicate rows.
Create a new table with the unique rows and drop the old table.

Asked in Rakuten

Q. How do you design data science solutions?
Designing data science solutions involves understanding requirements, data sources, modeling, and deployment strategies.
Identify the problem: Clearly define the business problem to be solved, e.g., predicting patient readmission rates.
Gather requirements: Collaborate with stakeholders to understand their needs and expectations.
Data collection: Identify and source relevant data, such as electronic health records or sensor data.
Data preprocessing: Clean and transform data to en...read more

Asked in Fractal Analytics

Q. Describe a time you were asked to write syntax-level code for Jinja statements in dbt.
Jinja in dbt allows for dynamic SQL generation using templating syntax.
Use `{{ }}` for expressions, e.g., `{{ ref('my_model') }}` to reference another model.
Use `{% %}` for control flow, e.g., `{% if condition %} ... {% endif %}` for conditional logic.
Loop through lists with `{% for item in list %} ... {% endfor %}`.
Define variables with `{% set var_name = value %}` and use them with `{{ var_name }}`.

Asked in Accenture

Q. What is the difference between an interactive cluster and a job cluster?
Interactive clusters allow for real-time interaction and exploration, while job clusters are used for running batch jobs.
Interactive clusters are used for real-time data exploration and analysis.
Job clusters are used for running batch jobs and processing large amounts of data.
Interactive clusters are typically smaller in size and have shorter lifespans.
Job clusters are usually larger and more powerful to handle heavy workloads.
Examples: Interactive clusters can be used for ad...read more
Data Engineer Jobs




Asked in TransOrg Analytics

Q. What is the Difference between Transformation and Actions in pyspark? And Give Example
Transformation in pyspark is lazy evaluation while Actions trigger execution of transformations.
Transformations are operations that are not executed immediately but create a plan for execution.
Actions are operations that trigger the execution of transformations and return results.
Examples of transformations include map, filter, and reduceByKey.
Examples of actions include collect, count, and saveAsTextFile.
Asked in Clippd

Q. Why and when should you use generators and decorators in Python?
Generators are used to create iterators, while decorators are used to modify functions or methods.
Generators are used to generate a sequence of values lazily, saving memory and improving performance.
Decorators are used to add functionality to existing functions or methods without modifying their code.
Generators are useful when dealing with large datasets or infinite sequences.
Decorators can be used for logging, caching, authentication, and more.
Example of generator: def my_ge...read more
Share interview questions and help millions of jobseekers 🌟

Asked in PwC

Q. What is the difference between reparation and coalesce? What is the difference between persist and cache?
repartition vs coalesce, persist vs cache
repartition is used to increase or decrease the number of partitions in a DataFrame, while coalesce is used to decrease the number of partitions without shuffling
persist is used to persist the DataFrame in memory or disk for faster access, while cache is a shorthand for persisting the DataFrame in memory only
repartition example: df.repartition(10)
coalesce example: df.coalesce(5)
persist example: df.persist()
cache example: df.cache()

Asked in HSBC Group

Q. Write an SQL query to get the second highest salary from each department.
SQL query to retrieve the second highest salary from each department
Use the RANK() function to assign a rank to each salary within each department
Filter the results to only include rows with a rank of 2
Group the results by department to get the second highest salary for each department

Asked in Amazon

Q. Which techniques would you use to optimize a system you have worked on in the past?
I would have used indexing, query optimization, and data partitioning to optimize the system.
Implement indexing on frequently queried columns to improve search performance.
Optimize queries by using proper joins, filters, and aggregations.
Partition large tables to distribute data across multiple storage devices for faster access.
Use materialized views to precompute and store aggregated data for quicker retrieval.

Asked in Infosys

Q. Describe Python dataframes, how they are used in projects, and when they are used.
Python dataframes are used to organize and manipulate data in a tabular format.
Dataframes are created using the pandas library in Python.
They allow for easy manipulation of data, such as filtering, sorting, and grouping.
Dataframes can be used in various projects, such as data analysis, machine learning, and data visualization.
Examples of using dataframes include analyzing sales data, predicting customer behavior, and visualizing stock market trends.

Asked in Procore

Q. Given a dataset containing product, date, and amount, calculate the revenue for 15 days and 30 days.
Calculate revenue over 15 and 30 days using SQL aggregation functions.
Use SUM() function to aggregate the 'amount' column.
Filter data using WHERE clause to limit the date range.
Example for 15 days: SELECT SUM(amount) FROM sales WHERE date >= CURRENT_DATE - INTERVAL '15 days';
Example for 30 days: SELECT SUM(amount) FROM sales WHERE date >= CURRENT_DATE - INTERVAL '30 days';
Ensure date format is consistent for accurate calculations.

Asked in Standard Chartered

Q. What is the probability that you can cut a rope into exactly two halves?
The probability of cutting a rope into exactly two halves is zero.
Cutting a rope into exactly two halves is impossible due to the thickness of the blade or scissors used.
Even if the rope is thin enough to be cut into two halves, the cut will never be perfectly straight.
Therefore, the probability of cutting a rope into exactly two halves is zero.

Asked in TransOrg Analytics

Q. what is Common Expression Query (CTE)?How CTE is different from Stored Procedure?
CTE is a temporary result set that can be referenced within a SELECT, INSERT, UPDATE, or DELETE statement. It is different from a Stored Procedure as it is only available for the duration of the query.
CTE stands for Common Table Expression and is defined using the WITH keyword.
CTEs are mainly used for recursive queries, complex joins, and simplifying complex queries.
CTEs are not stored in the database like Stored Procedures, they exist only for the duration of the query execu...read more

Asked in Grab Greco

Q. Write an SQL query to find the shortest flight duration from New York to Tokyo (HND).
Use SQL query to find shortest flight duration from NY to HND
Use SQL query with MIN function to find shortest duration
Filter flights from NY to HND using WHERE clause
Calculate duration by subtracting arrival time from departure time

Asked in Publicis Sapient

Q. Write SQL code to get the distance between city1 and city2 from a table, considering that city1 and city2 values can repeat.
SQL code to get the city1 city2 distance of table with repeating city1 and city2 values
Use a self join on the table to match city1 and city2
Calculate the distance between the cities using appropriate formula
Consider using a subquery if needed

Asked in TransOrg Analytics

Q. How would you find the second highest transacting member in each city?
Use SQL query with window function to rank members by transaction amount in each city.
Use SQL query with PARTITION BY clause to group members by city
Use ORDER BY clause to rank members by transaction amount
Select the second highest member for each city

Asked in KPMG India

Q. RDDs vs DataFrames: Which is better and why?
DataFrames are better than RDDs due to their optimized performance and ease of use.
DataFrames are optimized for better performance than RDDs.
DataFrames have a schema, making it easier to work with structured data.
DataFrames support SQL queries and can be used with Spark SQL.
RDDs are more low-level and require more manual optimization.
RDDs are useful for unstructured data or when fine-grained control is needed.

Asked in Infosys

Q. What is the SQL query to find the differences between the current day's sales and the previous day's sales?
SQL query to compare today's sales with yesterday's sales using aggregation and date functions.
Use a table with sales data that includes a date column.
Aggregate sales by date using SUM() function.
Use a Common Table Expression (CTE) or subquery to get sales for today and yesterday.
Calculate the difference between today's and yesterday's sales.

Asked in Koantek

Q. Write a Python program to sort elements of a string array by alphabet weightage and resolve all 23 test cases.
Sort string array elements by alphabet weightage in Python and pass 23 test cases.
Use the sorted() function with key parameter to sort elements by weightage
Define a function to calculate weightage of each character
Test the function with various test cases to ensure accuracy

Asked in Cymetrix Software

Q. About ETL - What do you know about it and what are fundamental factors to be considered while working on any ETL tool.
ETL stands for Extract, Transform, Load. It is a process of extracting data from various sources, transforming it, and loading it into a target system.
ETL is used to integrate data from different sources into a unified format.
The fundamental factors to consider while working on any ETL tool include data extraction, data transformation, and data loading.
Data extraction involves retrieving data from various sources such as databases, files, APIs, etc.
Data transformation involve...read more

Asked in Celebal Technologies

Q. Are you familiar with Celebal Technologies?
Celebal Technologies is a technology company specializing in data engineering and analytics solutions.
Celebal Technologies is known for providing data engineering and analytics solutions.
They offer services such as data integration, data warehousing, and data visualization.
Celebal Technologies works with clients across various industries to help them optimize their data processes.
They have expertise in technologies like Hadoop, Spark, and Python for data engineering.
The compa...read more

Asked in DXC Technology

Q. How can you design an Azure Data Factory pipeline to copy data from a folder containing files with different delimiters to another folder?
Design an Azure Data Factory pipeline to copy data with different delimiters.
Use a Copy Data activity in Azure Data Factory to copy data from source folder to destination folder.
Create a dataset for the source folder with multiple file formats to handle different delimiters.
Use a mapping data flow to transform the data if needed before copying to the destination folder.

Asked in Accenture

Q. How do you handle duplicates in Python?
Use Python's built-in data structures like sets or dictionaries to handle duplicates.
Use a set to remove duplicates from a list: unique_list = list(set(original_list))
Use a dictionary to remove duplicates from a list while preserving order: unique_list = list(dict.fromkeys(original_list))

Asked in Walmart

Q. If you need to process a 10TB file in Spark using static allocation, what configuration (executors and cores) would you choose and why?
For processing 10TB of file in Spark, consider allocating multiple executors with sufficient cores to maximize parallel processing.
Allocate multiple executors to handle the large file size efficiently
Determine the optimal number of cores per executor based on the available resources and workload
Consider the memory requirements for each executor to avoid out-of-memory errors
Adjust the configuration based on the specific requirements of the job and cluster setup

Asked in Tech Mahindra

Q. how to remove duplicate rows from bigquery? find the month of a given date in bigquery.
To remove duplicate rows from BigQuery, use the DISTINCT keyword. To find the month of a given date, use the EXTRACT function.
To remove duplicate rows, use SELECT DISTINCT * FROM table_name;
To find the month of a given date, use SELECT EXTRACT(MONTH FROM date_column) AS month_name FROM table_name;
Make sure to replace 'table_name' and 'date_column' with the appropriate values in your query.

Asked in AlphaSense

Q. Tell me about a data engineering challenge you faced. How did you tackle it and what was the outcome?
Migrating data from on-premise servers to cloud storage
Identified data sources and destination in cloud storage
Developed ETL pipelines to extract, transform, and load data
Ensured data integrity and security during migration process
Monitored and optimized performance of data transfer
Collaborated with cross-functional teams for successful migration

Asked in Fragma Data Systems

Q. There are four cores and four worker nodes in Spark. How many jobs will run in parallel?
Only one job will run in parallel in Spark with four cores and four worker nodes.
In Spark, each core can only run one task at a time, so with four cores, only four tasks can run concurrently.
Since there are four worker nodes, each with four cores, a total of four tasks can run in parallel.
Therefore, only one job will run in parallel in this scenario.

Asked in HSBC Group

Q. What are the window functions you have used?
Window functions are used to perform calculations across a set of rows that are related to the current row.
Commonly used window functions include ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, FIRST_VALUE, LAST_VALUE, and NTILE.
Window functions are used in conjunction with the OVER clause to define the window or set of rows to perform the calculation on.
Window functions can be used to calculate running totals, moving averages, and other aggregate calculations.
Window functions are s...read more
Interview Questions of Similar Designations
Interview Experiences of Popular Companies





Top Interview Questions for Data Engineer Related Skills



Reviews
Interviews
Salaries
Users

