Data Engineer
900+ Data Engineer Interview Questions and Answers
You and your friend Ninjax are playing a game of coins. Ninjax place the 'N' number of coins in a straight line.
The rule of the game is as follows:
1. Each coin has a value associate...read more
The task is to find the maximum amount you can definitely win in a game of coins against an opponent who plays optimally.
The game is played with alternating turns, and each player can pick the first or last coin from the line.
The value associated with the picked coin adds up to the total amount the player wins.
To maximize your winnings, you need to consider all possible combinations of coin picks.
Use dynamic programming to calculate the maximum amount you can win.
Keep track o...read more
You are given an array arr of length N. You have to return a list of integers containing the NGE(next greater element) of each element of the given array. The NGE for an element X is the fir...read more
The task is to find the next greater element for each element in the given array.
Iterate through the array from right to left.
Use a stack to keep track of the next greater element.
Pop elements from the stack until a greater element is found or the stack is empty.
If the stack is empty, there is no greater element, so assign -1.
If a greater element is found, assign it as the next greater element.
Push the current element onto the stack.
Return the list of next greater elements.
Data Engineer Interview Questions and Answers for Freshers
Aahad and Harshit always have fun by solving problems. Harshit took a sorted array and rotated it clockwise by an unknown amount. For example, he took a sorted array = [1, 2, 3, 4,...read more
This is a problem where a sorted array is rotated and we need to search for given numbers in the array.
The array is rotated clockwise by an unknown amount.
We need to search for Q numbers in the rotated array.
If a number is found, we need to return its index, otherwise -1.
The search needs to be done in O(logN) time complexity.
The input consists of the size of the array, the array itself, the number of queries, and the queries.
We are suffering from the Second wave of Covid-19. The Government is trying to increase its vaccination drives. Ninja wants to help the Government to plan an effective method to help increase v...read more
This question asks for finding the maximum number of vaccines administered on a specific day during a vaccination drive, given the total number of days, total number of vaccines available, and the day number.
Read the number of test cases
For each test case, read the number of days, day number, and total number of vaccines available
Implement a logic to find the maximum number of vaccines administered on the given day number
Print the maximum number of vaccines administered for e...read more
You are given two sorted arrays/list ‘arr1’ and ‘arr2’ and an integer k. You create a new sorted array by merging all the elements from ‘arr1’ and ‘arr2’. Your task is to find the ...read more
The task is to find the kth smallest element of a merged array created by merging two sorted arrays.
Merge the two sorted arrays into a single sorted array
Return the kth element of the merged array
Q6. 1) If you are given a card with 1-1000 numbers and there are 4 boxes. Card no 1 will go in box 1 , card 2 in box 2 and similarly it will go. Card 5 will again go in box 1. So what will be the logic for this cod...
read moreLogic for distributing cards among 4 boxes in a circular manner.
Use modulo operator to distribute cards among boxes in a circular manner.
If card number is divisible by 4, assign it to box 4.
If card number is divisible by 3, assign it to box 3.
If card number is divisible by 2, assign it to box 2.
If card number is not divisible by any of the above, assign it to box 1.
Share interview questions and help millions of jobseekers 🌟
Given a binary tree, return the zigzag level order traversal of the nodes' values of the given tree. Zigzag traversal means starting from left to right, then right to left for the ne...read more
The zigzag level order traversal of a binary tree is the traversal of its nodes' values in an alternate left to right and right to left manner.
Perform a level order traversal of the binary tree
Use a queue to store the nodes at each level
For each level, alternate the direction of traversal
Store the values of the nodes in each level in separate arrays
Combine the arrays in alternate order to get the zigzag level order traversal
Q8. Write code for printing duplicate numbers in a list.
Code to print duplicate numbers in a list.
Iterate through the list and keep track of the count of each number using a dictionary.
Print the numbers that have a count greater than 1.
Data Engineer Jobs
You are given a string S which represents a number. You have to find the smallest number strictly greater than the given number which contains the same set of digits as of the original number...read more
The task is to find the smallest number greater than the given number, with the same set of digits.
Iterate through the digits of the given number from right to left.
Find the first digit that is smaller than the digit to its right.
Swap this digit with the smallest digit to its right that is greater than it.
Sort the digits to the right of the swapped digit in ascending order.
If no such digit is found, return -1.
Q10. How do you handle changing schema from source. What are the common issues faced in hadoop and how did you resolve it?
Handling changing schema from source in Hadoop
Use schema evolution techniques like Avro or Parquet to handle schema changes
Implement a flexible ETL pipeline that can handle schema changes
Use tools like Apache NiFi to dynamically adjust schema during ingestion
Common issues include data loss, data corruption, and performance degradation
Resolve issues by implementing proper testing, monitoring, and backup strategies
Q11. 1. Design and code a scheduler for allocating meeting rooms for the given input of room counts and timestamps: Input : No of rooms : 2 Time and dutation: 12pm 30 min Output: yes. Everytime the code runs it shou...
read moreDesign and code a scheduler for allocating meeting rooms based on input of room counts and timestamps.
Create a table with columns for room number, start time, and end time
Use SQL queries to check for available slots and allocate rooms
Consider edge cases such as overlapping meetings and room availability
Use a loop to continuously check for available slots and allocate rooms
Implement error handling for invalid input
Asked several database related questions, specifically sql related, that how joins are used etc.
Q13. Python - sum of digits till the result becomes single digit Ex: 479 - 20 - 2
Python program to find the sum of digits till the result becomes a single digit.
Convert the number to a string and iterate through each digit.
Add the digits and store the result.
Repeat the process until the result becomes a single digit.
Return the single digit result.
Q14. Assume We had a PAN india Retail store because of which i have customer table in backend one is customer profile table and other is customer transaction table both will linked with customer id so what will the ...
read moreUse left join for computationally efficient way to find customer names from customer profile and transaction tables.
Use left join to combine customer profile and transaction tables based on customer id
Left join will include all customers from profile table even if they don't have transactions
Subquery may be less efficient as it has to be executed for each row in the result set
Q15. SQL question to get name of employee whose salary is greater than average salary of the department
SQL query to retrieve name of employee with salary greater than department average.
Calculate average salary of department using GROUP BY clause
Join employee and department tables using department ID
Filter employees with salary greater than department average
Select employee name
Q16. Sql - List employees whose age is greater than average age of all employees
List employees whose age is greater than average age of all employees using SQL.
Calculate the average age of all employees using AVG() function.
Use WHERE clause to filter out employees whose age is greater than the average age.
Join the employee table with the age table to get the age of each employee.
Example: SELECT * FROM employees WHERE age > (SELECT AVG(age) FROM employees);
Q17. 1) How to handle data skewness in spark.
Data skewness in Spark can be handled by partitioning, bucketing, or using salting techniques.
Partitioning the data based on a key column can distribute the data evenly across the nodes.
Bucketing can group the data into buckets based on a key column, which can improve join performance.
Salting involves adding a random prefix to the key column, which can distribute the data evenly.
Using broadcast joins for small tables can also help in reducing skewness.
Using dynamic allocation...read more
Q18. Write code to print reverse of a sentence word by word.
Code to print reverse of a sentence word by word.
Split the sentence into words using space as delimiter
Store the words in an array
Print the words in reverse order
Q19. What all the optimisation are possible to reduce the overhead of reducing the reading part of large datasets in spark ?
Optimizations like partitioning, caching, and using efficient file formats can reduce overhead in reading large datasets in Spark.
Partitioning data based on key can reduce the amount of data shuffled during joins and aggregations
Caching frequently accessed datasets in memory can avoid recomputation
Using efficient file formats like Parquet or ORC can reduce disk I/O and improve read performance
Q20. Write Pyspark code to read csv file and show top 10 records.
Pyspark code to read csv file and show top 10 records.
Import the necessary libraries
Create a SparkSession
Read the CSV file using the SparkSession
Display the top 10 records using the show() method
Q21. Write a sql query to find the name of person who logged in last within each country from Person Table ?
SQL query to find the name of person who logged in last within each country from Person Table
Use a subquery to find the max login time for each country
Join the Person table with the subquery on country and login time to get the name of the person
Create an api for the message to receive at the other end of the url provided.
Q23. What to import data from RDMS via sqoop without primary key
Use --split-by option in sqoop to import data from RDMS without primary key
Use --split-by option to specify a column to split the import into multiple mappers
Use --boundary-query option to specify a query to determine the range of values for --split-by column
Example: sqoop import --connect jdbc:mysql://localhost/mydb --username root --password password --table mytable --split-by id
Example: sqoop import --connect jdbc:mysql://localhost/mydb --username root --password password ...read more
Q24. In a word count spark program which command will run on driver and which will run on executor
Commands that run on driver and executor in a word count Spark program.
The command to read the input file and create RDD will run on driver.
The command to split the lines and count the words will run on executor.
The command to aggregate the word counts and write the output will run on driver.
Driver sends tasks to executors and coordinates the overall job.
Executor processes the tasks assigned by the driver.
Q25. What are the optimization techniques applied in pyspark code?
Optimization techniques in PySpark code include partitioning, caching, and using broadcast variables.
Partitioning data based on key columns to optimize join operations
Caching frequently accessed data in memory to avoid recomputation
Using broadcast variables to efficiently share small data across nodes
Using appropriate data types and avoiding unnecessary type conversions
Avoiding shuffling of data by using appropriate transformations and actions
Using appropriate data structures...read more
Q26. Difference between Coalesce and Repartition and In which case we are using it ?
Coalesce is used to combine multiple small partitions into a larger one, while Repartition is used to increase or decrease the number of partitions in a DataFrame.
Coalesce reduces the number of partitions in a DataFrame by combining small partitions into larger ones.
Repartition increases or decreases the number of partitions in a DataFrame by shuffling the data across partitions.
Coalesce is more efficient than Repartition as it minimizes data movement.
Coalesce is typically us...read more
Q27. Write pyspark code to change column name, divide one column by another column.
Pyspark code to change column name and divide one column by another column.
Use 'withColumnRenamed' method to change column name
Use 'withColumn' method to divide one column by another column
Example: df = df.withColumnRenamed('old_col_name', 'new_col_name').withColumn('new_col_name', df['col1']/df['col2'])
Q28. 1. Difference between shallow copy and deep copy 2. How can you merge two dataframes with different column names 3. Regex question to find all the characters and numbers of particular length size 4. Spark funda...
read more1. Shallow copy creates a new object but does not duplicate nested objects. Deep copy creates a new object and duplicates all nested objects. 2. Merging dataframes with different column names requires renaming columns. 3. Regex can be used to find characters and numbers of a specific length. 4. Spark fundamentals involve understanding distributed computing and data processing.
Shallow copy: new object with same references to nested objects. Deep copy: new object with duplicate...read more
Q29. Difference between Rank , Dense Rank and Row Number and when we are using each of them ?
Rank assigns a unique rank to each row, Dense Rank assigns a unique rank to each distinct row, and Row Number assigns a unique number to each row.
Rank assigns the same rank to rows with the same value, leaving gaps in the ranking if there are ties.
Dense Rank assigns a unique rank to each distinct row, leaving no gaps in the ranking.
Row Number assigns a unique number to each row, without any regard for the values in the rows.
Rank is used when you want to see the ranking of eac...read more
Q30. What happens when we enforce the schema and when we manually define the schema in the code ?
Enforcing the schema ensures data consistency and validation, while manually defining the schema in code allows for more flexibility and customization.
Enforcing the schema ensures that all data conforms to a predefined structure and format, preventing errors and inconsistencies.
Manually defining the schema in code allows for more flexibility in handling different data types and structures.
Enforcing the schema can be done using tools like Apache Avro or Apache Parquet, while m...read more
Q31. Difference between cache and persist, repartition and coalesce.
Cache and persist are used to store data in memory. Repartition and coalesce are used to change the number of partitions.
Cache stores the data in memory for faster access while persist allows the user to choose the storage level.
Repartition increases the number of partitions while coalesce decreases the number of partitions.
Cache and persist are transformations while repartition and coalesce are actions.
Cache and persist are used for iterative algorithms while repartition and...read more
Q32. 5) How to create a kafka topic with replication factor 2
To create a Kafka topic with replication factor 2, use the command line tool or Kafka API.
Use the command line tool 'kafka-topics.sh' with the '--replication-factor' flag set to 2.
Alternatively, use the Kafka API to create a topic with a replication factor of 2.
Ensure that the number of brokers in the Kafka cluster is greater than or equal to the replication factor.
Consider setting the 'min.insync.replicas' configuration property to 2 to ensure that at least two replicas are ...read more
Q33. If you want very less latency - which is better standalone or client mode?
Client mode is better for very less latency due to direct communication with the cluster.
Client mode allows direct communication with the cluster, reducing latency.
Standalone mode requires an additional layer of communication, increasing latency.
Client mode is preferred for real-time applications where low latency is crucial.
Q34. How to add a column in dataframe ? How to rename the column in dataframe ?
To add a column in a dataframe, use the 'withColumn' method. To rename a column, use the 'withColumnRenamed' method.
To add a column, use the 'withColumn' method with the new column name and the expression to compute the values for that column.
Example: df.withColumn('new_column', df['existing_column'] * 2)
To rename a column, use the 'withColumnRenamed' method with the current column name and the new column name.
Example: df.withColumnRenamed('old_column', 'new_column')
Q35. When a spark job is submitted, what happens at backend. Explain the flow.
When a spark job is submitted, various steps are executed at the backend to process the job.
The job is submitted to the Spark driver program.
The driver program communicates with the cluster manager to request resources.
The cluster manager allocates resources (CPU, memory) to the job.
The driver program creates DAG (Directed Acyclic Graph) of the job stages and tasks.
Tasks are then scheduled and executed on worker nodes in the cluster.
Intermediate results are stored in memory o...read more
Q36. How many stages will create from the above code that I have written
The number of stages created from the code provided depends on the specific code and its functionality.
The number of stages can vary based on the complexity of the code and the specific tasks being performed.
Stages may include data extraction, transformation, loading, and processing.
It is important to analyze the code and identify distinct stages to determine the total number.
Q37. 1. What is udf in Spark? 2. Write PySpark code to check the validity of mobile_number column
UDF stands for User-Defined Function in Spark. It allows users to define their own functions to process data.
UDFs can be written in different programming languages like Python, Scala, and Java.
UDFs can be used to perform complex operations on data that are not available in built-in functions.
PySpark code to check the validity of mobile_number column can be written using regular expressions and the `regexp_extract` function.
Example: `df.select('mobile_number', regexp_extract('...read more
Q38. Elaboration of Spark optimization techniques. Types of transformations, shuffling.
Spark optimization techniques include partitioning, caching, and using appropriate transformations.
Partitioning data can improve performance by reducing shuffling.
Caching frequently used data can reduce the need for recomputation.
Transformations like filter, map, and reduceByKey can be used to optimize data processing.
Shuffling can be minimized by using operations like reduceByKey instead of groupByKey.
Broadcasting small data can improve performance by reducing network traffi...read more
Q39. what is an internal and external table in Hive
Internal tables store data within Hive's warehouse directory while external tables store data outside of it.
Internal tables are managed by Hive and are deleted when the table is dropped
External tables are not managed by Hive and data is not deleted when the table is dropped
Internal tables are faster for querying as data is stored within Hive's warehouse directory
External tables are useful for sharing data between different systems
Example: CREATE TABLE my_table (col1 INT, col2...read more
Q40. How do you do performance optimization in Spark. Tell how you did it in you project.
Performance optimization in Spark involves tuning configurations, optimizing code, and utilizing caching.
Tune Spark configurations such as executor memory, number of executors, and shuffle partitions.
Optimize code by reducing unnecessary shuffles, using efficient transformations, and avoiding unnecessary data movements.
Utilize caching to store intermediate results in memory and avoid recomputation.
Example: In my project, I optimized Spark performance by increasing executor me...read more
Q41. What is a difference between dbms and rdbms
DBMS is a software system to manage databases while RDBMS is a type of DBMS that stores data in a structured manner.
DBMS stands for Database Management System while RDBMS stands for Relational Database Management System.
DBMS can manage any type of database while RDBMS manages only relational databases.
DBMS does not enforce any specific data model while RDBMS enforces the relational data model.
Examples of DBMS include MongoDB and Cassandra while examples of RDBMS include MySQL...read more
Q42. What is data flow? Difference with ADF pipeline and data flow
Data flow is a visual representation of data movement and transformation. ADF pipeline is a set of activities to move and transform data.
Data flow is a drag-and-drop interface to design data transformation logic
ADF pipeline is a set of activities to orchestrate data movement and transformation
Data flow is more flexible and powerful than ADF pipeline
Data flow can be used to transform data within a pipeline or as a standalone entity
Q43. What is Data Lake? Difference between data lake and data warehouse
Data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed.
Data lake stores raw, unstructured data from various sources.
Data lake allows for storing large amounts of data without the need for a predefined schema.
Data lake is cost-effective for storing data that may not have a clear use case at the time of storage.
Data warehouse stores structured data for querying and analysis.
Data warehouse requires a predefined schema for d...read more
Q44. What will happen if job has failed in pipeline and data processing cycle is over?
If a job fails in the pipeline and data processing cycle is over, it can lead to incomplete or inaccurate data.
Incomplete data may affect downstream processes and analysis
Data quality may be compromised if errors are not addressed
Monitoring and alerting systems should be in place to detect and handle failures
Re-running the failed job or implementing error handling mechanisms can help prevent issues in the future
Q45. Introduction Project flow Why did you use HBase in your project? How did you query for data in HBase? What was the purpose of Hive? What are external partitioned tables? Optimization done in your projects
Discussion on project flow, HBase, Hive, external partitioned tables, and optimization in a Data Engineer interview.
Explained project flow and the reason for using HBase in the project
Discussed querying data in HBase and the purpose of Hive
Described external partitioned tables and optimization techniques used in the project
Q46. Merge two unsorted lists such that the output list is sorted. You are free to use inbuilt sorting functions to sort the input lists
Merge two unsorted lists into a sorted list using inbuilt sorting functions.
Use inbuilt sorting functions to sort the input lists
Merge the sorted lists using a merge algorithm
Return the merged and sorted list
Q47. Write a python program to convert a number to words. For ex: i/p 123, o/p - One hundred twenty three
Python program to convert a number to words.
Use a dictionary to map numbers to words.
Divide the number into groups of three digits and convert each group to words.
Handle special cases like zero, negative numbers, and numbers greater than or equal to one billion.
Q48. Given in an example how to treat different categorised values based on their frequency.
Treating categorised values based on frequency involves grouping rare values together.
Identify rare values based on their frequency distribution
Group rare values together to reduce complexity
Consider creating a separate category for rare values
Q49. Explaination of current project architecture, Cloud services used in project and purpose of using them. Architecture of Spark,Hive
Our project architecture uses Spark and Hive for data processing and storage respectively. We utilize AWS services such as S3, EMR, and Glue for scalability and cost-effectiveness.
Spark is used for distributed data processing and analysis
Hive is used for data warehousing and querying
AWS S3 is used for storing large amounts of data
AWS EMR is used for running Spark and Hive clusters
AWS Glue is used for ETL (Extract, Transform, Load) jobs
The purpose of using these services is to...read more
Q50. Why do we need a data warehouse, why can't we store in the normal transactional database.
Data warehouses are designed for analytical queries and reporting, while transactional databases are optimized for transactional processing.
Data warehouses are optimized for read-heavy workloads, allowing for complex queries and reporting.
Transactional databases are optimized for write-heavy workloads, ensuring data integrity and consistency.
Data warehouses often store historical data for analysis, while transactional databases focus on current data for operational purposes.
D...read more
Interview Questions of Similar Designations
Top Interview Questions for Data Engineer Related Skills
Interview experiences of popular companies
Calculate your in-hand salary
Confused about how your in-hand salary is calculated? Enter your annual salary (CTC) and get your in-hand salary
Reviews
Interviews
Salaries
Users/Month