Data Engineer

1000+ Data Engineer Interview Questions and Answers

Updated 12 Jul 2025
search-icon

Asked in IBM

1w ago

Q. What is the difference between partitioning and bucketing?

Ans.

Partitioning is dividing data into smaller chunks based on a column value. Bucketing is dividing data into equal-sized buckets based on a hash function.

  • Partitioning is used for organizing data for efficient querying and processing.

  • Bucketing is used for evenly distributing data across nodes in a cluster.

  • Partitioning is done based on a column value, such as date or region.

  • Bucketing is done based on a hash function, such as MD5 or SHA-1.

  • Partitioning can improve query performance...read more

2w ago

Q. What optimization techniques have you used in your project?

Ans.

I have used techniques like indexing, query optimization, and parallel processing in my projects.

  • Indexing: Used to improve the speed of data retrieval by creating indexes on columns frequently used in queries.

  • Query optimization: Rewriting queries to improve efficiency and reduce execution time.

  • Parallel processing: Distributing tasks across multiple processors to speed up data processing.

  • Caching: Storing frequently accessed data in memory to reduce the need for repeated retrie...read more

Asked in Infovision

5d ago

Q. What is IR - integration Runtime? what are the types of IR

Ans.

Integration Runtime (IR) is a compute infrastructure that provides data integration capabilities across different network environments.

  • IR is used in Azure Data Factory to provide data integration capabilities

  • There are three types of IR: Azure, Self-hosted, and Azure-SSIS

  • Azure IR is fully managed by Microsoft and is used for data movement in the cloud

  • Self-hosted IR allows data movement between on-premises and cloud data stores

  • Azure-SSIS IR is used for running SQL Server Integr...read more

Q. How to find duplicate values in sql ?what is used of the cte and what are indexes,views ,triggers in sql

Ans.

Finding duplicates in SQL involves using GROUP BY and CTEs, while indexes, views, and triggers enhance database performance and functionality.

  • Use GROUP BY to find duplicates: SELECT column_name, COUNT(*) FROM table_name GROUP BY column_name HAVING COUNT(*) > 1.

  • Common Table Expressions (CTEs) simplify complex queries and improve readability: WITH cte AS (SELECT ...), SELECT * FROM cte.

  • Indexes speed up data retrieval: CREATE INDEX index_name ON table_name(column_name).

  • Views are...read more

Are these interview questions helpful?

Asked in EPAM Systems

2w ago

Q. Write code to print the reverse of a sentence word by word.

Ans.

This code reverses the order of words in a given sentence, providing a clear output of the reversed sentence.

  • Split the Sentence: Use a method to split the sentence into an array of words. Example: 'Hello World' becomes ['Hello', 'World'].

  • Reverse the Array: Utilize an array method to reverse the order of the words. Example: ['Hello', 'World'] becomes ['World', 'Hello'].

  • Join the Words: Combine the reversed array back into a string. Example: ['World', 'Hello'] becomes 'World Hel...read more

Asked in HSBC Group

2w ago

Q. How do you handle null values in PySpark?

Ans.

Null values in PySpark are handled using functions such as dropna(), fillna(), and replace().

  • dropna() function is used to drop rows or columns with null values

  • fillna() function is used to fill null values with a specified value or method

  • replace() function is used to replace null values with a specified value

  • coalesce() function is used to replace null values with the first non-null value in a list of columns

Data Engineer Jobs

SANOFI HEALTHCARE INDIA PRIVATE LIMITED logo
Data Engineer 2-3 years
SANOFI HEALTHCARE INDIA PRIVATE LIMITED
4.1
Hyderabad / Secunderabad
Mondel z International logo
Data Engineer 2-9 years
Mondel z International
4.2
Mumbai
AMERICAN EXPRESS logo
Data Engineer 2-4 years
AMERICAN EXPRESS
4.1
₹ 13 L/yr - ₹ 17 L/yr
Gurgaon / Gurugram

Asked in Linedata

5d ago

Q. How would you monitor an overnight data load job in Snowflake?

Ans.

Monitor overnight data load job in Snowflake

  • Set up alerts and notifications for job completion or failure

  • Check job logs for any errors or issues

  • Monitor resource usage during the data load process

  • Use Snowflake's query history to track job progress

  • Implement automated retries in case of failures

Asked in DELTA

1w ago

Q. How can you achieve AWS cross-account sharing?

Ans.

AWS cross account sharing can be achieved by using IAM roles and policies.

  • Create an IAM role in the account that will share resources

  • Define a trust policy in the role to allow the other account to assume the role

  • Attach a policy to the role granting the necessary permissions

  • In the receiving account, create an IAM role with a trust policy allowing the sharing account to assume the role

  • Use the AWS CLI or console to assume the role and access the shared resources

Share interview questions and help millions of jobseekers 🌟

man-with-laptop

Asked in Capgemini

1w ago

Q. How do you remove duplicates in a DataFrame using PySpark?

Ans.

Use dropDuplicates() function in pyspark to remove duplicates in a data frame.

  • Use dropDuplicates() function on the data frame to remove duplicates based on all columns.

  • Specify subset of columns to remove duplicates based on specific columns.

  • Use the distinct() function to remove duplicates and keep only distinct rows.

Asked in HCLTech

6d ago

Q. How do you connect to Azure Data Lake Storage Gen2 from Databricks?

Ans.

To connect to ADLS Gen2 from Databricks, you can use the Azure Blob Storage API.

  • Use the Azure Blob Storage API to connect to ADLS Gen2 from Databricks

  • Provide the storage account name and key for authentication

  • Use the storage account name as the filesystem

  • Example: spark.conf.set('fs.azure.account.key..blob.core.windows.net', '')

2w ago

Q. 1) Optimizations techniques used while working on Spark and hive. 2) difference between partitioning and bucketing 3) How to add column in Data frame 4) Difference between cache and persistent

Ans.

Answers to questions related to Spark, Hive, and Data Frames

  • Optimization techniques in Spark and Hive include partitioning, bucketing, and caching

  • Partitioning divides data into smaller, more manageable chunks while bucketing groups data based on a specific column

  • Adding a column to a Data Frame can be done using the 'withColumn' method

  • Caching stores data in memory for faster access while persistence stores data on disk for durability

Asked in Paytm

2w ago

Q. 1) What is internal mechanism in spark . 2) tungsten project in spark explanation 3) sql problem to check where last two transaction belongs to particular retail

Ans.

Questions related to Spark internals, Tungsten project, and SQL problem for retail transactions.

  • Spark's internal mechanism includes components like Spark Core, Spark SQL, Spark Streaming, and MLlib.

  • Tungsten project in Spark aims to improve the performance of Spark by optimizing memory usage and CPU utilization.

  • To solve the SQL problem, we can use a query to filter transactions for a particular retail and then use the 'ORDER BY' clause to sort them by date and time. We can the...read more

Asked in TCS

5d ago

Q. what is view in SQL and dense and dense rank

Ans.

View is a virtual table created from a SQL query. Dense rank assigns a unique rank to each row in a result set.

  • A view is a saved SQL query that can be used as a table

  • Dense rank assigns a unique rank to each row in a result set, with no gaps between the ranks

  • Dense rank is used to rank rows based on a specific column or set of columns

  • Example: SELECT * FROM my_view WHERE column_name = 'value'

  • Example: SELECT column_name, DENSE_RANK() OVER (ORDER BY column_name) FROM my_table

Asked in PwC

4d ago

Q. What are the concepts of coalesce and repartition in data processing?

Ans.

Coalesce and repartition are concepts used in data processing to control the number of partitions in a dataset.

  • Coalesce is used to reduce the number of partitions in a dataset without shuffling the data, which can improve performance.

  • Repartition is used to increase or decrease the number of partitions in a dataset by shuffling the data across the cluster.

  • Coalesce is preferred over repartition when reducing partitions to avoid unnecessary shuffling of data.

  • Repartition is usefu...read more

Asked in ConcertAI

2w ago

Q. What are the different AWS Data Analytics services used in your project and explanation for why each service was used? What are the alternate services available and why they were not used in the project? Questi...

read more
Ans.

AWS Data Analytics services used in project, alternate services, and RDBMS concepts

  • AWS Data Analytics services used: Amazon Redshift for data warehousing, Amazon EMR for big data processing, Amazon Athena for interactive querying

  • Explanation for usage: Redshift for storing and analyzing large datasets, EMR for processing and analyzing big data, Athena for ad-hoc querying

  • Alternate services not used: Amazon RDS for relational database management, Amazon Kinesis for real-time dat...read more

2w ago

Q. Design a business case for using self-join, excluding hierarchical use cases such as teacher-student or employee-manager relationships.

Ans.

Using self join to analyze customer behavior in an e-commerce platform.

  • Identifying patterns in customer purchase history

  • Analyzing customer preferences based on past purchases

  • Segmenting customers based on their buying behavior

1w ago

Q. What is normalization in SQL, and can you explain 1NF, 2NF, and 3NF?

Ans.

Normalization in SQL is the process of organizing data in a database to reduce redundancy and improve data integrity.

  • 1NF (First Normal Form) - Each column in a table must contain atomic values, and there should be no repeating groups.

  • 2NF (Second Normal Form) - Table should be in 1NF and all non-key attributes are fully functional dependent on the primary key.

  • 3NF (Third Normal Form) - Table should be in 2NF and there should be no transitive dependencies between non-key attribu...read more

Asked in Cognizant

2w ago

Q. What is the difference between supervised and unsupervised learning?

Ans.

Supervised learning uses labeled data to train the model, while unsupervised learning uses unlabeled data.

  • Supervised learning requires a target variable to be predicted, while unsupervised learning does not.

  • In supervised learning, the model learns from labeled training data, whereas in unsupervised learning, the model finds patterns in unlabeled data.

  • Examples of supervised learning include regression and classification tasks, while clustering is a common unsupervised learning...read more

Asked in Perficient

1w ago

Q. What tools have you used for data engineering?

Ans.

Tools used for data engineering include ETL tools, programming languages, databases, and cloud platforms.

  • ETL tools like Apache NiFi, Talend, and Informatica are used for data extraction, transformation, and loading.

  • Programming languages like Python, Java, and Scala are used for data processing and analysis.

  • Databases like MySQL, PostgreSQL, and MongoDB are used for storing and managing data.

  • Cloud platforms like AWS, Azure, and Google Cloud provide scalable infrastructure for d...read more

Asked in TCS

2d ago

Q. Write a function to check if a number is an Armstrong Number.

Ans.

Function to check if a number is an Armstrong Number

  • An Armstrong Number is a number that is equal to the sum of its own digits raised to the power of the number of digits

  • To check if a number is an Armstrong Number, we need to calculate the sum of each digit raised to the power of the number of digits

  • If the sum is equal to the original number, then it is an Armstrong Number

Asked in PwC

1w ago

Q. What is afd? build dynamic pipeline spark arcticture sql data flow

Ans.

AFD is not a commonly used term in data engineering. Can you provide more context?

    Asked in IBM

    6d ago

    Q. What is the difference between cache and persistent storage?

    Ans.

    Cache is temporary storage used to speed up access to frequently accessed data. Persistent storage is permanent storage used to store data even after power loss.

    • Cache is faster but smaller than persistent storage

    • Cache is volatile and data is lost when power is lost

    • Persistent storage is non-volatile and data is retained even after power loss

    • Examples of cache include CPU cache, browser cache, and CDN cache

    • Examples of persistent storage include hard disk drives, solid-state driv...read more

    2d ago

    Q. Explain databricks dlt, and when will you use batch vs streaming?

    Ans.

    Databricks DLT is a unified data management platform for batch and streaming processing.

    • Databricks DLT (Delta Lake Table) is a storage layer that brings ACID transactions to Apache Spark and big data workloads.

    • Batch processing is used when data is collected over a period of time and processed in large chunks, while streaming processing is used for real-time data processing.

    • Use batch processing for historical data analysis, ETL jobs, and periodic reporting. Use streaming proce...read more

    Asked in HSBC Group

    6d ago

    Q. What is the imputer function in PySpark?

    Ans.

    Imputer function in PySpark is used to replace missing values in a DataFrame.

    • Imputer is a transformer in PySpark ML library.

    • It replaces missing values in a DataFrame with either mean, median, or mode of the column.

    • It can be used with both numerical and categorical columns.

    • Example: imputer = Imputer(inputCols=['col1', 'col2'], outputCols=['col1_imputed', 'col2_imputed'], strategy='mean')

    • Example: imputed_df = imputer.fit(df).transform(df)

    Asked in Accenture

    3d ago

    Q. Rate your proficiency with PySpark, Python, and SQL on a scale of 1 to 5.

    Ans.

    I would rate myself 4 in Pyspark, 5 in Python, and 4 in SQL.

    • Strong proficiency in Python programming language

    • Experience in working with Pyspark for big data processing

    • Proficient in writing complex SQL queries for data manipulation

    • Familiarity with optimizing queries for performance

    • Hands-on experience in data engineering projects

    1w ago

    Q. What type of schemas did you use for your project? (Star schema, Snowflake Schema)

    Ans.

    I have experience working with both Star and Snowflake schemas in my projects.

    • Star schema is a denormalized schema where one central fact table is connected to multiple dimension tables.

    • Snowflake schema is a normalized schema where dimension tables are further normalized into sub-dimension tables.

    • Used Star schema for simpler, smaller datasets where performance is a priority.

    • Used Snowflake schema for complex, larger datasets where data integrity and scalability are important.

    Asked in Nagarro

    1w ago

    Q. What is the difference between OLAP (Online Analytical Processing) and OLTP (Online Transaction Processing)?

    Ans.

    OLAP is for analysis and reporting, while OLTP is for managing day-to-day transactional data.

    • OLAP systems are optimized for read-heavy operations, allowing complex queries and aggregations.

    • OLTP systems are designed for high transaction throughput, focusing on insert, update, and delete operations.

    • OLAP databases often use star or snowflake schemas for data organization, while OLTP databases use normalized schemas.

    • Examples of OLAP tools include Microsoft SQL Server Analysis Ser...read more

    Asked in Infosys

    2w ago

    Q. What is the process of performing ETL (Extract, Transform, Load), and can you provide an example?

    Ans.

    ETL is a data integration process that involves extracting data, transforming it for analysis, and loading it into a target system.

    • Extract: Gather data from various sources like databases, APIs, or flat files. Example: Pulling customer data from a CRM system.

    • Transform: Clean and format the data to meet business requirements. Example: Converting date formats or aggregating sales data.

    • Load: Insert the transformed data into a target database or data warehouse. Example: Loading t...read more

    Asked in LTIMindtree

    1w ago

    Q. Calculate second highest salary using SQL as well as pyspark.

    Ans.

    Calculate second highest salary using SQL and pyspark

    • Use SQL query with ORDER BY and LIMIT to get the second highest salary

    • In pyspark, use orderBy() and take() functions to achieve the same result

    Q. How do you optimize SQL queries?

    Ans.

    Optimizing SQL queries involves using indexes, avoiding unnecessary joins, and optimizing the query structure.

    • Use indexes on columns frequently used in WHERE clauses

    • Avoid using SELECT * and only retrieve necessary columns

    • Optimize joins by using INNER JOIN instead of OUTER JOIN when possible

    • Use EXPLAIN to analyze query performance and make necessary adjustments

    Previous
    1
    2
    3
    4
    5
    6
    7
    Next

    Interview Experiences of Popular Companies

    TCS Logo
    3.6
     • 11.1k Interviews
    Accenture Logo
    3.7
     • 8.7k Interviews
    Infosys Logo
    3.6
     • 7.9k Interviews
    Wipro Logo
    3.7
     • 6.1k Interviews
    Cognizant Logo
    3.7
     • 5.9k Interviews
    View all
    interview tips and stories logo
    Interview Tips & Stories
    Ace your next interview with expert advice and inspiring stories
    Data Engineer Interview Questions
    Share an Interview
    Stay ahead in your career. Get AmbitionBox app
    play-icon
    play-icon
    qr-code
    Trusted by over 1.5 Crore job seekers to find their right fit company
    80 L+

    Reviews

    10L+

    Interviews

    4 Cr+

    Salaries

    1.5 Cr+

    Users

    Contribute to help millions

    Made with ❤️ in India. Trademarks belong to their respective owners. All rights reserved © 2025 Info Edge (India) Ltd.

    Follow Us
    • Youtube
    • Instagram
    • LinkedIn
    • Facebook
    • Twitter
    Profile Image
    Hello, Guest
    AmbitionBox Employee Choice Awards 2025
    Winners announced!
    awards-icon
    Contribute to help millions!
    Write a review
    Write a review
    Share interview
    Share interview
    Contribute salary
    Contribute salary
    Add office photos
    Add office photos
    Add office benefits
    Add office benefits