Data Engineer
1000+ Data Engineer Interview Questions and Answers

Asked in IBM

Q. What is the difference between partitioning and bucketing?
Partitioning is dividing data into smaller chunks based on a column value. Bucketing is dividing data into equal-sized buckets based on a hash function.
Partitioning is used for organizing data for efficient querying and processing.
Bucketing is used for evenly distributing data across nodes in a cluster.
Partitioning is done based on a column value, such as date or region.
Bucketing is done based on a hash function, such as MD5 or SHA-1.
Partitioning can improve query performance...read more

Asked in Fragma Data Systems

Q. What optimization techniques have you used in your project?
I have used techniques like indexing, query optimization, and parallel processing in my projects.
Indexing: Used to improve the speed of data retrieval by creating indexes on columns frequently used in queries.
Query optimization: Rewriting queries to improve efficiency and reduce execution time.
Parallel processing: Distributing tasks across multiple processors to speed up data processing.
Caching: Storing frequently accessed data in memory to reduce the need for repeated retrie...read more

Asked in Infovision

Q. What is IR - integration Runtime? what are the types of IR
Integration Runtime (IR) is a compute infrastructure that provides data integration capabilities across different network environments.
IR is used in Azure Data Factory to provide data integration capabilities
There are three types of IR: Azure, Self-hosted, and Azure-SSIS
Azure IR is fully managed by Microsoft and is used for data movement in the cloud
Self-hosted IR allows data movement between on-premises and cloud data stores
Azure-SSIS IR is used for running SQL Server Integr...read more

Asked in IntelliSurge Technologies

Q. How to find duplicate values in sql ?what is used of the cte and what are indexes,views ,triggers in sql
Finding duplicates in SQL involves using GROUP BY and CTEs, while indexes, views, and triggers enhance database performance and functionality.
Use GROUP BY to find duplicates: SELECT column_name, COUNT(*) FROM table_name GROUP BY column_name HAVING COUNT(*) > 1.
Common Table Expressions (CTEs) simplify complex queries and improve readability: WITH cte AS (SELECT ...), SELECT * FROM cte.
Indexes speed up data retrieval: CREATE INDEX index_name ON table_name(column_name).
Views are...read more

Asked in EPAM Systems

Q. Write code to print the reverse of a sentence word by word.
This code reverses the order of words in a given sentence, providing a clear output of the reversed sentence.
Split the Sentence: Use a method to split the sentence into an array of words. Example: 'Hello World' becomes ['Hello', 'World'].
Reverse the Array: Utilize an array method to reverse the order of the words. Example: ['Hello', 'World'] becomes ['World', 'Hello'].
Join the Words: Combine the reversed array back into a string. Example: ['World', 'Hello'] becomes 'World Hel...read more

Asked in HSBC Group

Q. How do you handle null values in PySpark?
Null values in PySpark are handled using functions such as dropna(), fillna(), and replace().
dropna() function is used to drop rows or columns with null values
fillna() function is used to fill null values with a specified value or method
replace() function is used to replace null values with a specified value
coalesce() function is used to replace null values with the first non-null value in a list of columns
Data Engineer Jobs




Asked in Linedata

Q. How would you monitor an overnight data load job in Snowflake?
Monitor overnight data load job in Snowflake
Set up alerts and notifications for job completion or failure
Check job logs for any errors or issues
Monitor resource usage during the data load process
Use Snowflake's query history to track job progress
Implement automated retries in case of failures

Asked in DELTA

Q. How can you achieve AWS cross-account sharing?
AWS cross account sharing can be achieved by using IAM roles and policies.
Create an IAM role in the account that will share resources
Define a trust policy in the role to allow the other account to assume the role
Attach a policy to the role granting the necessary permissions
In the receiving account, create an IAM role with a trust policy allowing the sharing account to assume the role
Use the AWS CLI or console to assume the role and access the shared resources
Share interview questions and help millions of jobseekers 🌟

Asked in Capgemini

Q. How do you remove duplicates in a DataFrame using PySpark?
Use dropDuplicates() function in pyspark to remove duplicates in a data frame.
Use dropDuplicates() function on the data frame to remove duplicates based on all columns.
Specify subset of columns to remove duplicates based on specific columns.
Use the distinct() function to remove duplicates and keep only distinct rows.

Asked in HCLTech

Q. How do you connect to Azure Data Lake Storage Gen2 from Databricks?
To connect to ADLS Gen2 from Databricks, you can use the Azure Blob Storage API.
Use the Azure Blob Storage API to connect to ADLS Gen2 from Databricks
Provide the storage account name and key for authentication
Use the storage account name as the filesystem
Example: spark.conf.set('fs.azure.account.key.
.blob.core.windows.net', ' ')

Asked in Infobell IT Solutions

Q. 1) Optimizations techniques used while working on Spark and hive. 2) difference between partitioning and bucketing 3) How to add column in Data frame 4) Difference between cache and persistent
Answers to questions related to Spark, Hive, and Data Frames
Optimization techniques in Spark and Hive include partitioning, bucketing, and caching
Partitioning divides data into smaller, more manageable chunks while bucketing groups data based on a specific column
Adding a column to a Data Frame can be done using the 'withColumn' method
Caching stores data in memory for faster access while persistence stores data on disk for durability

Asked in Paytm

Q. 1) What is internal mechanism in spark . 2) tungsten project in spark explanation 3) sql problem to check where last two transaction belongs to particular retail
Questions related to Spark internals, Tungsten project, and SQL problem for retail transactions.
Spark's internal mechanism includes components like Spark Core, Spark SQL, Spark Streaming, and MLlib.
Tungsten project in Spark aims to improve the performance of Spark by optimizing memory usage and CPU utilization.
To solve the SQL problem, we can use a query to filter transactions for a particular retail and then use the 'ORDER BY' clause to sort them by date and time. We can the...read more

Asked in TCS

Q. what is view in SQL and dense and dense rank
View is a virtual table created from a SQL query. Dense rank assigns a unique rank to each row in a result set.
A view is a saved SQL query that can be used as a table
Dense rank assigns a unique rank to each row in a result set, with no gaps between the ranks
Dense rank is used to rank rows based on a specific column or set of columns
Example: SELECT * FROM my_view WHERE column_name = 'value'
Example: SELECT column_name, DENSE_RANK() OVER (ORDER BY column_name) FROM my_table

Asked in PwC

Q. What are the concepts of coalesce and repartition in data processing?
Coalesce and repartition are concepts used in data processing to control the number of partitions in a dataset.
Coalesce is used to reduce the number of partitions in a dataset without shuffling the data, which can improve performance.
Repartition is used to increase or decrease the number of partitions in a dataset by shuffling the data across the cluster.
Coalesce is preferred over repartition when reducing partitions to avoid unnecessary shuffling of data.
Repartition is usefu...read more

Asked in ConcertAI

Q. What are the different AWS Data Analytics services used in your project and explanation for why each service was used? What are the alternate services available and why they were not used in the project? Questi...
read moreAWS Data Analytics services used in project, alternate services, and RDBMS concepts
AWS Data Analytics services used: Amazon Redshift for data warehousing, Amazon EMR for big data processing, Amazon Athena for interactive querying
Explanation for usage: Redshift for storing and analyzing large datasets, EMR for processing and analyzing big data, Athena for ad-hoc querying
Alternate services not used: Amazon RDS for relational database management, Amazon Kinesis for real-time dat...read more

Asked in TransOrg Analytics

Q. Design a business case for using self-join, excluding hierarchical use cases such as teacher-student or employee-manager relationships.
Using self join to analyze customer behavior in an e-commerce platform.
Identifying patterns in customer purchase history
Analyzing customer preferences based on past purchases
Segmenting customers based on their buying behavior

Asked in TransOrg Analytics

Q. What is normalization in SQL, and can you explain 1NF, 2NF, and 3NF?
Normalization in SQL is the process of organizing data in a database to reduce redundancy and improve data integrity.
1NF (First Normal Form) - Each column in a table must contain atomic values, and there should be no repeating groups.
2NF (Second Normal Form) - Table should be in 1NF and all non-key attributes are fully functional dependent on the primary key.
3NF (Third Normal Form) - Table should be in 2NF and there should be no transitive dependencies between non-key attribu...read more

Asked in Cognizant

Q. What is the difference between supervised and unsupervised learning?
Supervised learning uses labeled data to train the model, while unsupervised learning uses unlabeled data.
Supervised learning requires a target variable to be predicted, while unsupervised learning does not.
In supervised learning, the model learns from labeled training data, whereas in unsupervised learning, the model finds patterns in unlabeled data.
Examples of supervised learning include regression and classification tasks, while clustering is a common unsupervised learning...read more

Asked in Perficient

Q. What tools have you used for data engineering?
Tools used for data engineering include ETL tools, programming languages, databases, and cloud platforms.
ETL tools like Apache NiFi, Talend, and Informatica are used for data extraction, transformation, and loading.
Programming languages like Python, Java, and Scala are used for data processing and analysis.
Databases like MySQL, PostgreSQL, and MongoDB are used for storing and managing data.
Cloud platforms like AWS, Azure, and Google Cloud provide scalable infrastructure for d...read more

Asked in TCS

Q. Write a function to check if a number is an Armstrong Number.
Function to check if a number is an Armstrong Number
An Armstrong Number is a number that is equal to the sum of its own digits raised to the power of the number of digits
To check if a number is an Armstrong Number, we need to calculate the sum of each digit raised to the power of the number of digits
If the sum is equal to the original number, then it is an Armstrong Number

Asked in PwC

Q. What is afd? build dynamic pipeline spark arcticture sql data flow
AFD is not a commonly used term in data engineering. Can you provide more context?

Asked in IBM

Q. What is the difference between cache and persistent storage?
Cache is temporary storage used to speed up access to frequently accessed data. Persistent storage is permanent storage used to store data even after power loss.
Cache is faster but smaller than persistent storage
Cache is volatile and data is lost when power is lost
Persistent storage is non-volatile and data is retained even after power loss
Examples of cache include CPU cache, browser cache, and CDN cache
Examples of persistent storage include hard disk drives, solid-state driv...read more

Asked in Tiger Analytics

Q. Explain databricks dlt, and when will you use batch vs streaming?
Databricks DLT is a unified data management platform for batch and streaming processing.
Databricks DLT (Delta Lake Table) is a storage layer that brings ACID transactions to Apache Spark and big data workloads.
Batch processing is used when data is collected over a period of time and processed in large chunks, while streaming processing is used for real-time data processing.
Use batch processing for historical data analysis, ETL jobs, and periodic reporting. Use streaming proce...read more

Asked in HSBC Group

Q. What is the imputer function in PySpark?
Imputer function in PySpark is used to replace missing values in a DataFrame.
Imputer is a transformer in PySpark ML library.
It replaces missing values in a DataFrame with either mean, median, or mode of the column.
It can be used with both numerical and categorical columns.
Example: imputer = Imputer(inputCols=['col1', 'col2'], outputCols=['col1_imputed', 'col2_imputed'], strategy='mean')
Example: imputed_df = imputer.fit(df).transform(df)

Asked in Accenture

Q. Rate your proficiency with PySpark, Python, and SQL on a scale of 1 to 5.
I would rate myself 4 in Pyspark, 5 in Python, and 4 in SQL.
Strong proficiency in Python programming language
Experience in working with Pyspark for big data processing
Proficient in writing complex SQL queries for data manipulation
Familiarity with optimizing queries for performance
Hands-on experience in data engineering projects

Asked in Hexaware Technologies

Q. What type of schemas did you use for your project? (Star schema, Snowflake Schema)
I have experience working with both Star and Snowflake schemas in my projects.
Star schema is a denormalized schema where one central fact table is connected to multiple dimension tables.
Snowflake schema is a normalized schema where dimension tables are further normalized into sub-dimension tables.
Used Star schema for simpler, smaller datasets where performance is a priority.
Used Snowflake schema for complex, larger datasets where data integrity and scalability are important.

Asked in Nagarro

Q. What is the difference between OLAP (Online Analytical Processing) and OLTP (Online Transaction Processing)?
OLAP is for analysis and reporting, while OLTP is for managing day-to-day transactional data.
OLAP systems are optimized for read-heavy operations, allowing complex queries and aggregations.
OLTP systems are designed for high transaction throughput, focusing on insert, update, and delete operations.
OLAP databases often use star or snowflake schemas for data organization, while OLTP databases use normalized schemas.
Examples of OLAP tools include Microsoft SQL Server Analysis Ser...read more

Asked in Infosys

Q. What is the process of performing ETL (Extract, Transform, Load), and can you provide an example?
ETL is a data integration process that involves extracting data, transforming it for analysis, and loading it into a target system.
Extract: Gather data from various sources like databases, APIs, or flat files. Example: Pulling customer data from a CRM system.
Transform: Clean and format the data to meet business requirements. Example: Converting date formats or aggregating sales data.
Load: Insert the transformed data into a target database or data warehouse. Example: Loading t...read more

Asked in LTIMindtree

Q. Calculate second highest salary using SQL as well as pyspark.
Calculate second highest salary using SQL and pyspark
Use SQL query with ORDER BY and LIMIT to get the second highest salary
In pyspark, use orderBy() and take() functions to achieve the same result
Asked in Prudent Tech IT Solutions

Q. How do you optimize SQL queries?
Optimizing SQL queries involves using indexes, avoiding unnecessary joins, and optimizing the query structure.
Use indexes on columns frequently used in WHERE clauses
Avoid using SELECT * and only retrieve necessary columns
Optimize joins by using INNER JOIN instead of OUTER JOIN when possible
Use EXPLAIN to analyze query performance and make necessary adjustments
Interview Questions of Similar Designations
Interview Experiences of Popular Companies





Top Interview Questions for Data Engineer Related Skills



Reviews
Interviews
Salaries
Users

