Data Engineer
80+ Data Engineer Interview Questions and Answers for Freshers
Q1. If you want very less latency - which is better standalone or client mode?
Client mode is better for very less latency due to direct communication with the cluster.
Client mode allows direct communication with the cluster, reducing latency.
Standalone mode requires an additional layer of communication, increasing latency.
Client mode is preferred for real-time applications where low latency is crucial.
Q2. When a spark job is submitted, what happens at backend. Explain the flow.
When a spark job is submitted, various steps are executed at the backend to process the job.
The job is submitted to the Spark driver program.
The driver program communicates with the cluster manager to request resources.
The cluster manager allocates resources (CPU, memory) to the job.
The driver program creates DAG (Directed Acyclic Graph) of the job stages and tasks.
Tasks are then scheduled and executed on worker nodes in the cluster.
Intermediate results are stored in memory o...read more
Q3. How do you do performance optimization in Spark. Tell how you did it in you project.
Performance optimization in Spark involves tuning configurations, optimizing code, and utilizing caching.
Tune Spark configurations such as executor memory, number of executors, and shuffle partitions.
Optimize code by reducing unnecessary shuffles, using efficient transformations, and avoiding unnecessary data movements.
Utilize caching to store intermediate results in memory and avoid recomputation.
Example: In my project, I optimized Spark performance by increasing executor me...read more
Q4. What is Data Lake? Difference between data lake and data warehouse
Data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed.
Data lake stores raw, unstructured data from various sources.
Data lake allows for storing large amounts of data without the need for a predefined schema.
Data lake is cost-effective for storing data that may not have a clear use case at the time of storage.
Data warehouse stores structured data for querying and analysis.
Data warehouse requires a predefined schema for d...read more
Q5. Why do we need a data warehouse, why can't we store in the normal transactional database.
Data warehouses are designed for analytical queries and reporting, while transactional databases are optimized for transactional processing.
Data warehouses are optimized for read-heavy workloads, allowing for complex queries and reporting.
Transactional databases are optimized for write-heavy workloads, ensuring data integrity and consistency.
Data warehouses often store historical data for analysis, while transactional databases focus on current data for operational purposes.
D...read more
Q6. what are the tools I used for the data engineering ?
Tools used for data engineering include ETL tools, programming languages, databases, and cloud platforms.
ETL tools like Apache NiFi, Talend, and Informatica are used for data extraction, transformation, and loading.
Programming languages like Python, Java, and Scala are used for data processing and analysis.
Databases like MySQL, PostgreSQL, and MongoDB are used for storing and managing data.
Cloud platforms like AWS, Azure, and Google Cloud provide scalable infrastructure for d...read more
Share interview questions and help millions of jobseekers 🌟
Q7. how to remove duplicate rows from bigquery? find the month of a given date in bigquery.
To remove duplicate rows from BigQuery, use the DISTINCT keyword. To find the month of a given date, use the EXTRACT function.
To remove duplicate rows, use SELECT DISTINCT * FROM table_name;
To find the month of a given date, use SELECT EXTRACT(MONTH FROM date_column) AS month_name FROM table_name;
Make sure to replace 'table_name' and 'date_column' with the appropriate values in your query.
Q8. Python dataframes and how we use them in project and where at time
Python dataframes are used to organize and manipulate data in a tabular format.
Dataframes are created using the pandas library in Python.
They allow for easy manipulation of data, such as filtering, sorting, and grouping.
Dataframes can be used in various projects, such as data analysis, machine learning, and data visualization.
Examples of using dataframes include analyzing sales data, predicting customer behavior, and visualizing stock market trends.
Data Engineer Jobs
Q9. Calculate second highest salary using SQL as well as pyspark.
Calculate second highest salary using SQL and pyspark
Use SQL query with ORDER BY and LIMIT to get the second highest salary
In pyspark, use orderBy() and take() functions to achieve the same result
Q10. How do you optimize SQL queries?
Optimizing SQL queries involves using indexes, avoiding unnecessary joins, and optimizing the query structure.
Use indexes on columns frequently used in WHERE clauses
Avoid using SELECT * and only retrieve necessary columns
Optimize joins by using INNER JOIN instead of OUTER JOIN when possible
Use EXPLAIN to analyze query performance and make necessary adjustments
Q11. what operator is used in composer to move data from gcs to bq
The operator used in Composer to move data from GCS to BigQuery is the GCS to BigQuery operator.
The GCS to BigQuery operator is used in Apache Airflow, which is the underlying technology of Composer.
This operator allows you to transfer data from Google Cloud Storage (GCS) to BigQuery.
You can specify the source and destination parameters in the operator to define the data transfer process.
Q12. 2 types of modes for Spark architecture ?
The two types of modes for Spark architecture are standalone mode and cluster mode.
Standalone mode: Spark runs on a single machine with a single JVM and is suitable for development and testing.
Cluster mode: Spark runs on a cluster of machines managed by a cluster manager like YARN or Mesos for production workloads.
Q13. how does aws glue work and how can etl be performed with glue.
AWS Glue is a fully managed ETL service that makes it easy to prepare and load data for analytics.
AWS Glue works by automatically discovering, cataloging, and transforming your data
ETL can be performed with Glue by creating ETL jobs using the Glue console or API
Glue supports various data sources such as Amazon S3, RDS, Redshift, and more
You can schedule and monitor ETL jobs using Glue's built-in scheduler and monitoring tools
Q14. What is SparkContext and SparkSession?
SparkContext is the main entry point for Spark functionality, while SparkSession is the entry point for Spark SQL.
SparkContext is the entry point for low-level API functionality in Spark.
SparkSession is the entry point for Spark SQL functionality.
SparkContext is used to create RDDs (Resilient Distributed Datasets) in Spark.
SparkSession provides a unified entry point for reading data from various sources and performing SQL queries.
Q15. do you have experience in aws glue, how will you use glue for data migration?
Yes, I have experience in AWS Glue and can use it for data migration.
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data for analytics.
To use Glue for data migration, I would start by creating a Glue job that defines the source and target data sources, as well as any transformations needed.
I would then configure the job to run on a schedule or trigger it manually to migrate the data from the source to the target.
G...read more
Q16. SQL Window Functions and RANK, DENSE RANK Difference
SQL Window Functions like RANK and DENSE RANK are used to assign a rank to rows within a partition.
RANK function assigns a unique rank to each distinct row within a partition.
DENSE RANK function assigns a unique rank to each distinct row within a partition, but without any gaps.
Both functions are used with the OVER() clause in SQL to define the partition and order of rows.
Q17. What is linked list, hashmap, dictionary, tuples and list....
Linked list, hashmap, dictionary, tuples and list are data structures used in programming.
Linked list is a linear data structure where each element points to the next element.
Hashmap is a data structure that maps keys to values for efficient lookup.
Dictionary is a collection of key-value pairs where each key is unique.
Tuples are immutable sequences of elements of different data types.
List is a collection of elements of the same data type that can be modified.
Q18. Higest nth rank and lowest rank if no of employees less than 2
If number of employees is less than 2, highest and lowest rank will be the same.
If there are less than 2 employees, both highest and lowest rank will be the same.
For example, if there is only 1 employee, their rank will be 1 which is both the highest and lowest rank.
Q19. write a code for this - input = [1,2,3,4] output = [1,4,9,16]
Code to square each element in the input array.
Iterate through the input array and square each element.
Store the squared values in a new array to get the desired output.
Q20. Remove zeroes from a numbers only string and insert at last also maintain the relative order
Remove zeroes from a numbers only string and insert at last while maintaining relative order.
Iterate through the string and separate numbers and zeroes
Store numbers in an array and zeroes in a separate array
Concatenate the numbers array with the zeroes array at the end
Q21. most frequent word in a sentence ?
The most frequent word in a sentence can be found by counting the occurrence of each word and selecting the one with the highest count.
Split the sentence into words using whitespace as delimiter
Create a dictionary to store the count of each word
Iterate through the words and update the count in the dictionary
Find the word with the highest count in the dictionary
Q22. How does Union all works with a given scenario?
Union all combines the results of two or more SELECT statements into a single result set.
Union all is used to combine the results of multiple SELECT statements into a single result set.
It includes all rows from each SELECT statement, even if there are duplicates.
The number of columns and data types must be the same in all SELECT statements.
Example: SELECT column1 FROM table1 UNION ALL SELECT column1 FROM table2;
Example: SELECT name, age FROM employees UNION ALL SELECT name, a...read more
Q23. what is Union All in SQL and how it works?
Union All in SQL is used to combine the result sets of two or more SELECT statements, including duplicate rows.
Union All is used to combine the results of multiple SELECT statements into a single result set.
It includes all rows from each SELECT statement, even if there are duplicate rows.
The number of columns and data types must be the same in all SELECT statements being combined.
Example: SELECT column1 FROM table1 UNION ALL SELECT column1 FROM table2;
Q24. How can you manage the pressure?
I manage pressure by prioritizing tasks, taking breaks, and seeking support when needed.
Prioritize tasks based on urgency and importance
Take breaks to recharge and avoid burnout
Seek support from colleagues or a mentor
Practice stress-reducing activities like meditation or exercise
Q25. What is lazy evaluation in spark.
Lazy evaluation in Spark delays the execution of transformations until an action is called.
Lazy evaluation allows Spark to optimize the execution plan by combining multiple transformations into a single stage.
Transformations are not executed immediately, but are stored as a directed acyclic graph (DAG) of operations.
Actions trigger the execution of the DAG and produce results.
Example: map() and filter() are transformations that are lazily evaluated until an action like collec...read more
Q26. What is skewness and skewd tables
Skewness is a measure of asymmetry in a distribution. Skewed tables are tables with imbalanced data distribution.
Skewness is a statistical measure that describes the asymmetry of the data distribution around the mean.
Positive skewness indicates a longer tail on the right side of the distribution, while negative skewness indicates a longer tail on the left side.
Skewed tables in data engineering refer to tables with imbalanced data distribution, which can impact query performan...read more
Q27. What is spark and explain working
Spark is a distributed computing framework designed for big data processing.
Spark is built around the concept of Resilient Distributed Datasets (RDDs) which allow for fault-tolerant parallel processing of data.
It provides high-level APIs in Java, Scala, Python, and R for ease of use.
Spark can run on top of Hadoop, Mesos, Kubernetes, or in standalone mode.
It includes modules for SQL, streaming, machine learning, and graph processing.
Spark uses in-memory processing to speed up ...read more
Q28. architecture of bq. Query optimization techniques in bigquery.
BigQuery architecture includes storage, execution, and optimization components for efficient query processing.
BigQuery stores data in Capacitor storage system for fast access.
Query execution is distributed across multiple nodes for parallel processing.
Query optimization techniques include partitioning tables, clustering tables, and using query cache.
Using partitioned tables can help eliminate scanning unnecessary data.
Clustering tables based on certain columns can improve que...read more
Q29. What is overloading and overriding inJava
Overloading is having multiple methods with the same name but different parameters. Overriding is implementing a method in a subclass that is already defined in the parent class.
Overloading allows a class to have multiple methods with the same name but different parameters.
Overriding occurs when a subclass provides a specific implementation of a method that is already provided by its parent class.
Overloading is determined at compile time based on the method signature, while o...read more
Q30. How to decide upon Spark cluster sizing?
Spark cluster sizing depends on workload, data size, memory requirements, and processing speed.
Consider the size of the data being processed
Take into account the memory requirements of the Spark jobs
Factor in the processing speed needed for the workload
Scale the cluster based on the number of nodes and cores required
Monitor performance and adjust cluster size as needed
Q31. What is Data WareHouse..?
A Data Warehouse is a centralized repository that stores integrated data from multiple sources for analysis and reporting.
Data Warehouses are designed for query and analysis rather than transaction processing.
They often contain historical data and are used for decision-making purposes.
Data Warehouses typically use a dimensional model with facts and dimensions.
Examples of Data Warehouse tools include Amazon Redshift, Snowflake, and Google BigQuery.
Q32. Create datafram, create new column, find the consucutive user id by date
Create dataframe, add new column, find consecutive user IDs by date
Create a dataframe with user IDs and dates
Add a new column to calculate the difference in dates for each user ID
Identify consecutive user IDs based on the date difference
Q33. Which is better ETL/ELT
ETL is better for batch processing, ELT is better for real-time processing.
ETL is better for large volumes of data that need to be transformed before loading into a data warehouse.
ELT is better for real-time processing where data can be loaded into a data warehouse first and then transformed as needed.
ETL requires more storage space as data is transformed before loading, while ELT saves storage space by loading data first and transforming later.
Q34. What are the languages based in sql
Languages based in SQL include T-SQL, PL/SQL, and SQL/PSM.
T-SQL (Transact-SQL) is Microsoft's proprietary extension to SQL used in SQL Server.
PL/SQL (Procedural Language/SQL) is Oracle's procedural language extension to SQL.
SQL/PSM (SQL/Persistent Stored Modules) is a standard for defining stored procedures and functions in SQL.
Q35. what do you understand by oops concepts
Object-oriented programming concepts that focus on classes and objects
Encapsulation: bundling data and methods that operate on the data within a single unit
Inheritance: ability of a class to inherit properties and behavior from another class
Polymorphism: ability to present the same interface for different data types
Abstraction: hiding the complex implementation details and showing only the necessary features
Q36. What is Kpi? How can we monitor a server
KPI stands for Key Performance Indicator. Servers can be monitored using tools like Nagios, Zabbix, or Prometheus.
KPI is a measurable value that demonstrates how effectively a company is achieving key business objectives.
To monitor a server, tools like Nagios, Zabbix, or Prometheus can be used to track metrics such as CPU usage, memory usage, disk space, and network traffic.
Setting up alerts based on predefined thresholds can help in proactively managing server performance an...read more
Q37. how migrate data from local server to AWS redshift
To migrate data from a local server to AWS Redshift, you can use various methods such as AWS Database Migration Service, AWS Glue, or manual ETL processes.
Use AWS Database Migration Service (DMS) to replicate data from the local server to Redshift
Create a DMS replication instance and endpoints for the source and target databases
Configure the replication task to specify the source and target endpoints, table mappings, and transformation rules
Start the replication task to migra...read more
Q38. Explain the blockers I got while working?
Blockers faced while working as a Data Engineer
Lack of proper documentation
Inadequate infrastructure
Data quality issues
Limited access to necessary data sources
Inefficient data processing pipelines
Q39. Difference between select and withcolumn in pyspark
select is used to select specific columns from a DataFrame, while withColumn is used to add or update columns in a DataFrame.
select is used to select specific columns from a DataFrame
withColumn is used to add or update columns in a DataFrame
select does not modify the original DataFrame, while withColumn returns a new DataFrame with the added/updated column
Example: df.select('col1', 'col2') - selects columns col1 and col2 from DataFrame df
Example: df.withColumn('new_col', df['...read more
Q40. check if there is a loop in linked list
Check for a loop in a linked list by using two pointers moving at different speeds.
Use two pointers, one moving at a normal speed and another moving at double the speed.
If there is a loop, the two pointers will eventually meet at some point.
Alternatively, use a hash set to store visited nodes and check for duplicates.
Q41. Difference between sql joins, few sql queries
SQL joins are used to combine rows from different tables based on a related column, while SQL queries are used to retrieve specific data from a database.
SQL joins are used to combine rows from different tables based on a related column
There are different types of SQL joins, such as inner join, left join, right join, and full outer join
SQL queries are used to retrieve specific data from a database based on certain conditions
Common SQL queries include SELECT, INSERT, UPDATE, an...read more
Q42. Nested Queries in Bigquery..?
Nested queries in BigQuery allow for querying data from within another query, enabling complex data analysis.
Nested queries are queries that are embedded within another query
They can be used to perform subqueries to filter, aggregate, or manipulate data
Nested queries can be used in SELECT, FROM, WHERE, and HAVING clauses
Q43. Data pipeline design and best practices.
Data pipeline design involves creating a system to efficiently collect, process, and analyze data.
Understand the data sources and requirements before designing the pipeline.
Use tools like Apache Kafka, Apache NiFi, or AWS Glue for data ingestion and processing.
Implement data validation and error handling mechanisms to ensure data quality.
Consider scalability and performance optimization while designing the pipeline.
Document the pipeline architecture and processes for future r...read more
Q44. Difference between Delta lake and data lake
Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads.
Delta Lake provides ACID transactions, schema enforcement, and time travel capabilities on top of data lakes.
Data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed.
Delta Lake ensures data reliability and data quality by providing ACID transactions.
Data lake lacks ACID transactions and schema enforcement, making...read more
Q45. Write a command to fetch data form the table
Use SQL SELECT command to fetch data from a table
Use the SELECT statement followed by the column names you want to retrieve
Specify the table name after the keyword FROM
Add conditions using WHERE clause if needed
Q46. difference between bigtable and bigquery.
Bigtable is a NoSQL database for real-time analytics, while BigQuery is a fully managed data warehouse for running SQL queries.
Bigtable is a NoSQL database designed for real-time analytics and high throughput, while BigQuery is a fully managed data warehouse for running SQL queries.
Bigtable is used for storing large amounts of semi-structured data, while BigQuery is used for analyzing structured data using SQL queries.
Bigtable is suitable for real-time data processing and hig...read more
Q47. Difference between variables and parameters in ADF
Variables are used to store values that can be changed, while parameters are used to pass values into activities in ADF.
Variables can be modified within a pipeline, while parameters are set at runtime and cannot be changed within the pipeline.
Variables are defined within a pipeline, while parameters are defined at the pipeline level.
Variables can be used to store intermediate values or results, while parameters are used to pass values between activities.
Example: A variable ca...read more
Q48. azure tech stack used in the current project
Azure tech stack used in the current project includes Azure Data Factory, Azure Databricks, and Azure SQL Database.
Azure Data Factory for data integration and orchestration
Azure Databricks for big data processing and analytics
Azure SQL Database for storing and querying structured data
Q49. RDD vs dataframe vs dataset in pyspark
RDD vs dataframe vs dataset in PySpark
RDD (Resilient Distributed Dataset) is the basic abstraction in PySpark, representing a distributed collection of objects
Dataframe is a distributed collection of data organized into named columns, similar to a table in a relational database
Dataset is a distributed collection of data with the ability to use custom classes for type safety and user-defined functions
Dataframes and Datasets are built on top of RDDs, providing a more structured...read more
Q50. How to call button in js
You can call a button in JavaScript by using the document.getElementById() method.
Use document.getElementById() method to select the button by its id attribute
Assign the selected button to a variable for further manipulation
Call button methods or add event listeners to the button variable
Interview Questions of Similar Designations
Top Interview Questions for Data Engineer Related Skills
Interview experiences of popular companies
Calculate your in-hand salary
Confused about how your in-hand salary is calculated? Enter your annual salary (CTC) and get your in-hand salary
Reviews
Interviews
Salaries
Users/Month