Data Engineer
100+ Data Engineer Interview Questions and Answers for Freshers

Asked in LTIMindtree

Q. When a Spark job is submitted, what happens at the backend? Explain the flow.
When a spark job is submitted, various steps are executed at the backend to process the job.
The job is submitted to the Spark driver program.
The driver program communicates with the cluster manager to request resources.
The cluster manager allocates resources (CPU, memory) to the job.
The driver program creates DAG (Directed Acyclic Graph) of the job stages and tasks.
Tasks are then scheduled and executed on worker nodes in the cluster.
Intermediate results are stored in memory o...read more

Asked in LTIMindtree

Q. For minimal latency, is standalone or client mode preferable?
Client mode is better for very less latency due to direct communication with the cluster.
Client mode allows direct communication with the cluster, reducing latency.
Standalone mode requires an additional layer of communication, increasing latency.
Client mode is preferred for real-time applications where low latency is crucial.

Asked in LTIMindtree

Q. How do you do performance optimization in Spark? How did you do it in your project?
Performance optimization in Spark involves tuning configurations, optimizing code, and utilizing caching.
Tune Spark configurations such as executor memory, number of executors, and shuffle partitions.
Optimize code by reducing unnecessary shuffles, using efficient transformations, and avoiding unnecessary data movements.
Utilize caching to store intermediate results in memory and avoid recomputation.
Example: In my project, I optimized Spark performance by increasing executor me...read more

Asked in Procore

Q. What is Data Lake? Difference between data lake and data warehouse
Data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed.
Data lake stores raw, unstructured data from various sources.
Data lake allows for storing large amounts of data without the need for a predefined schema.
Data lake is cost-effective for storing data that may not have a clear use case at the time of storage.
Data warehouse stores structured data for querying and analysis.
Data warehouse requires a predefined schema for d...read more

Asked in Procore

Q. Why do we need a data warehouse? Why can't we store data in a normal transactional database?
Data warehouses are designed for analytical queries and reporting, while transactional databases are optimized for transactional processing.
Data warehouses are optimized for read-heavy workloads, allowing for complex queries and reporting.
Transactional databases are optimized for write-heavy workloads, ensuring data integrity and consistency.
Data warehouses often store historical data for analysis, while transactional databases focus on current data for operational purposes.
D...read more

Asked in Infosys

Q. Describe Python dataframes, how they are used in projects, and when they are used.
Python dataframes are used to organize and manipulate data in a tabular format.
Dataframes are created using the pandas library in Python.
They allow for easy manipulation of data, such as filtering, sorting, and grouping.
Dataframes can be used in various projects, such as data analysis, machine learning, and data visualization.
Examples of using dataframes include analyzing sales data, predicting customer behavior, and visualizing stock market trends.
Data Engineer Jobs




Asked in Procore

Q. Given a dataset containing product, date, and amount, calculate the revenue for 15 days and 30 days.
Calculate revenue over 15 and 30 days using SQL aggregation functions.
Use SUM() function to aggregate the 'amount' column.
Filter data using WHERE clause to limit the date range.
Example for 15 days: SELECT SUM(amount) FROM sales WHERE date >= CURRENT_DATE - INTERVAL '15 days';
Example for 30 days: SELECT SUM(amount) FROM sales WHERE date >= CURRENT_DATE - INTERVAL '30 days';
Ensure date format is consistent for accurate calculations.

Asked in Tech Mahindra

Q. how to remove duplicate rows from bigquery? find the month of a given date in bigquery.
To remove duplicate rows from BigQuery, use the DISTINCT keyword. To find the month of a given date, use the EXTRACT function.
To remove duplicate rows, use SELECT DISTINCT * FROM table_name;
To find the month of a given date, use SELECT EXTRACT(MONTH FROM date_column) AS month_name FROM table_name;
Make sure to replace 'table_name' and 'date_column' with the appropriate values in your query.
Share interview questions and help millions of jobseekers 🌟

Asked in IntelliSurge Technologies

Q. How to find duplicate values in sql ?what is used of the cte and what are indexes,views ,triggers in sql
Finding duplicates in SQL involves using GROUP BY and CTEs, while indexes, views, and triggers enhance database performance and functionality.
Use GROUP BY to find duplicates: SELECT column_name, COUNT(*) FROM table_name GROUP BY column_name HAVING COUNT(*) > 1.
Common Table Expressions (CTEs) simplify complex queries and improve readability: WITH cte AS (SELECT ...), SELECT * FROM cte.
Indexes speed up data retrieval: CREATE INDEX index_name ON table_name(column_name).
Views are...read more

Asked in Perficient

Q. What tools have you used for data engineering?
Tools used for data engineering include ETL tools, programming languages, databases, and cloud platforms.
ETL tools like Apache NiFi, Talend, and Informatica are used for data extraction, transformation, and loading.
Programming languages like Python, Java, and Scala are used for data processing and analysis.
Databases like MySQL, PostgreSQL, and MongoDB are used for storing and managing data.
Cloud platforms like AWS, Azure, and Google Cloud provide scalable infrastructure for d...read more

Asked in LTIMindtree

Q. Calculate second highest salary using SQL as well as pyspark.
Calculate second highest salary using SQL and pyspark
Use SQL query with ORDER BY and LIMIT to get the second highest salary
In pyspark, use orderBy() and take() functions to achieve the same result
Asked in Prudent Tech IT Solutions

Q. How do you optimize SQL queries?
Optimizing SQL queries involves using indexes, avoiding unnecessary joins, and optimizing the query structure.
Use indexes on columns frequently used in WHERE clauses
Avoid using SELECT * and only retrieve necessary columns
Optimize joins by using INNER JOIN instead of OUTER JOIN when possible
Use EXPLAIN to analyze query performance and make necessary adjustments

Asked in LTIMindtree

Q. Write SQL and PySpark code for a given dataset scenario.
SQL and PySpark code examples for data manipulation and analysis.
Use SQL for structured queries: SELECT, JOIN, GROUP BY.
Example SQL: SELECT name, COUNT(*) FROM patients GROUP BY name;
Use PySpark for big data processing: DataFrame API, RDDs.
Example PySpark: df.groupBy('name').count().show();
Optimize queries with indexing in SQL and caching in PySpark.

Asked in Tech Mahindra

Q. Which operator in Composer is used to move data from GCS to BigQuery?
The operator used in Composer to move data from GCS to BigQuery is the GCS to BigQuery operator.
The GCS to BigQuery operator is used in Apache Airflow, which is the underlying technology of Composer.
This operator allows you to transfer data from Google Cloud Storage (GCS) to BigQuery.
You can specify the source and destination parameters in the operator to define the data transfer process.

Asked in TCS

Q. Do you have experience with AWS Glue? How would you use Glue for data migration?
Yes, I have experience in AWS Glue and can use it for data migration.
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data for analytics.
To use Glue for data migration, I would start by creating a Glue job that defines the source and target data sources, as well as any transformations needed.
I would then configure the job to run on a schedule or trigger it manually to migrate the data from the source to the target.
G...read more

Asked in LTIMindtree

Q. What are the two types of modes for Spark architecture?
The two types of modes for Spark architecture are standalone mode and cluster mode.
Standalone mode: Spark runs on a single machine with a single JVM and is suitable for development and testing.
Cluster mode: Spark runs on a cluster of machines managed by a cluster manager like YARN or Mesos for production workloads.

Asked in IntelliSurge Technologies

Q. What is a cursor, and how do you write a stored procedure in SQL Server Management Studio (SSMS)?
A cursor in SQL Server allows row-by-row processing of query results, and stored procedures encapsulate SQL code for reuse.
A cursor is a database object used to retrieve, manipulate, and navigate through a result set one row at a time.
Cursors can be declared using the syntax: DECLARE cursor_name CURSOR FOR SELECT statement.
To open a cursor, use: OPEN cursor_name; and to fetch rows, use: FETCH NEXT FROM cursor_name;
Stored procedures are created using the syntax: CREATE PROCEDU...read more

Asked in bp

Q. What is the purpose of using Common Table Expressions (CTE) in SQL?
CTEs are used in SQL to create temporary result sets that can be referenced within a query.
CTEs improve readability and maintainability of complex queries
They can be recursively referenced to solve hierarchical problems
CTEs can be used to simplify queries by breaking them into smaller, more manageable parts

Asked in LTIMindtree

Q. What are SparkContext and SparkSession?
SparkContext is the main entry point for Spark functionality, while SparkSession is the entry point for Spark SQL.
SparkContext is the entry point for low-level API functionality in Spark.
SparkSession is the entry point for Spark SQL functionality.
SparkContext is used to create RDDs (Resilient Distributed Datasets) in Spark.
SparkSession provides a unified entry point for reading data from various sources and performing SQL queries.

Asked in TCS

Q. Can you introduce yourself and provide an overview of your experience in a data engineering role?
Data engineer with 5 years of experience in building scalable data pipelines and optimizing data workflows for analytics.
5 years of experience in data engineering, focusing on ETL processes and data warehousing.
Proficient in Python and SQL for data manipulation and analysis.
Experience with cloud platforms like AWS and Azure for deploying data solutions.
Implemented a real-time data pipeline using Apache Kafka, improving data availability by 30%.
Worked with data visualization t...read more

Asked in Vertoz

Q. SQL Window Functions and RANK, DENSE RANK Difference
SQL Window Functions like RANK and DENSE RANK are used to assign a rank to rows within a partition.
RANK function assigns a unique rank to each distinct row within a partition.
DENSE RANK function assigns a unique rank to each distinct row within a partition, but without any gaps.
Both functions are used with the OVER() clause in SQL to define the partition and order of rows.
Asked in HackFuel Services

Q. How does AWS Glue work, and how can ETL be performed with Glue?
AWS Glue is a fully managed ETL service that makes it easy to prepare and load data for analytics.
AWS Glue works by automatically discovering, cataloging, and transforming your data
ETL can be performed with Glue by creating ETL jobs using the Glue console or API
Glue supports various data sources such as Amazon S3, RDS, Redshift, and more
You can schedule and monitor ETL jobs using Glue's built-in scheduler and monitoring tools

Asked in BridgeLabz Solutions

Q. What are the four pillars of OOPS? Explain with a real-world example.
The four pillars of OOP are encapsulation, inheritance, polymorphism, and abstraction, essential for building robust software.
Encapsulation: Bundling data and methods that operate on the data within a single unit (class). Example: A 'Car' class with properties like 'speed' and methods like 'accelerate()'.
Inheritance: Mechanism where a new class inherits properties and methods from an existing class. Example: A 'SportsCar' class inheriting from the 'Car' class.
Polymorphism: Ab...read more

Asked in AROHA TECHNOLOGIES

Q. What is the process to write a factorial program in Python?
A factorial program in Python calculates the product of all positive integers up to a given number.
Define a function named 'factorial' that takes an integer 'n' as input.
Use a base case: if 'n' is 0 or 1, return 1.
Use recursion or a loop to multiply 'n' by the factorial of 'n-1'.
Example using recursion: 'return n * factorial(n - 1)'.
Example using a loop: 'for i in range(2, n + 1): result *= i'.

Asked in Reflections Info Systems

Q. Can you describe a challenging data engineering project you worked on and how you handled it?
I tackled a complex data pipeline project that integrated disparate healthcare data sources for real-time analytics.
Identified data silos across multiple healthcare systems, including EHRs and lab systems.
Designed a robust ETL pipeline using Apache Airflow to automate data extraction and transformation.
Implemented data quality checks to ensure accuracy and consistency of incoming data.
Collaborated with cross-functional teams to understand data requirements and ensure alignmen...read more

Asked in Tech Mahindra

Q. Write code to transform the input list [1,2,3,4] to the output list [1,4,9,16].
Code to square each element in the input array.
Iterate through the input array and square each element.
Store the squared values in a new array to get the desired output.

Asked in Tiger Analytics

Q. Given some tables, how would you extract values between specified ranges and filter by certain criteria?
Extract and filter values from tables based on specified criteria.
Identify the tables involved and their relationships.
Use SQL queries to extract data, e.g., SELECT * FROM table WHERE value BETWEEN x AND y.
Apply filtering conditions using WHERE clause, e.g., AND status = 'active'.
Consider using JOINs if data is spread across multiple tables.
Example: SELECT name FROM patients WHERE age BETWEEN 30 AND 40 AND diagnosis = 'diabetes'.
Asked in Nyalazone Solutions

Q. What is linked list, hashmap, dictionary, tuples and list....
Linked list, hashmap, dictionary, tuples and list are data structures used in programming.
Linked list is a linear data structure where each element points to the next element.
Hashmap is a data structure that maps keys to values for efficient lookup.
Dictionary is a collection of key-value pairs where each key is unique.
Tuples are immutable sequences of elements of different data types.
List is a collection of elements of the same data type that can be modified.

Asked in Amazon

Q. In SQL, given two tables with null values, what are the row counts resulting from different types of joins?
Understanding row counts from different SQL joins with null values in two tables.
INNER JOIN: Returns rows with matching values in both tables. Nulls in join columns are excluded.
LEFT JOIN: Returns all rows from the left table and matched rows from the right table. Nulls in the right table are included.
RIGHT JOIN: Returns all rows from the right table and matched rows from the left table. Nulls in the left table are included.
FULL OUTER JOIN: Returns all rows when there is a ma...read more

Asked in Myntra

Q. How would you find the highest nth rank and lowest rank if the number of employees is less than 2?
If number of employees is less than 2, highest and lowest rank will be the same.
If there are less than 2 employees, both highest and lowest rank will be the same.
For example, if there is only 1 employee, their rank will be 1 which is both the highest and lowest rank.
Interview Questions of Similar Designations
Interview Experiences of Popular Companies





Top Interview Questions for Data Engineer Related Skills



Reviews
Interviews
Salaries
Users

