Associate Data Engineer

30+ Associate Data Engineer Interview Questions and Answers

Updated 7 Nov 2024

Q1. Datastge - How will you remove Header and trailer from Sequential data file

Ans.

To remove header and trailer from a sequential data file in Datastage.

Use Sequential File stage in Datastage.
Set the 'Skip Rows' property to the number of header rows to be skipped.
Set the 'Trailer Rows' property to the number of trailer rows to be skipped.
Use a Transformer stage to remove any remaining header or trailer rows.
Use the 'Remove' function in the Transformer stage to remove the rows.

View 1 answer

Q2. How to delete duplicate rows in Sql

Ans.

Deleting duplicate rows in SQL

Use the DISTINCT keyword in SELECT statement to retrieve unique rows
Use GROUP BY clause to group rows with same values and then use aggregate functions to select one row
Use the ROW_NUMBER() function to assign a unique number to each row and then delete the rows with duplicate numbers

View 3 more answers

Associate Data Engineer Interview Questions and Answers for Freshers

View all interview questions

Q3. How would you recommend customer to migrate on cloud

Ans.

To recommend customers to migrate to the cloud, assess their current infrastructure, plan the migration strategy, choose the right cloud provider, and ensure data security.

Assess the customer's current infrastructure and identify the applications and data that can be migrated to the cloud.
Plan the migration strategy by considering factors like cost, time, and resource requirements.
Choose the right cloud provider based on the customer's specific needs and requirements.
Ensure d...read more

View 1 answer

Q4. How to find process id in linux

Ans.

To find process id in Linux, use the command 'ps -aux | grep '

Open the terminal
Type 'ps -aux' to list all running processes
Use 'grep ' to filter the process you are looking for
The process id (PID) will be listed in the second column

View 2 more answers

Are these interview questions helpful?

Q5. Reading Data from a .log file and finding out each column with a specific regex.

Ans.

Reading data from a .log file and extracting columns with a specific regex.

Use Python's built-in 're' module to define the regex pattern.
Open the .log file using Python's 'open' function.
Iterate through each line of the file and extract the desired columns using the regex pattern.
Store the extracted data in a data structure such as a list or dictionary.

Q6. How would you kill any job in Datastage

Ans.

To kill a job in Datastage

Stop the job manually from the Director client
Terminate the job from the command line using the dsjob command
Kill the job process from the operating system level
Delete the job from the Datastage repository

Share interview questions and help millions of jobseekers 🌟

Q7. Real time scenarios like —why amazon sending wrong products for customers

Ans.

Amazon may send wrong products due to various reasons.

Incorrect product labeling or packaging
Human error in picking and packing process
Technical glitches in inventory management system
Fraudulent activities by third-party sellers
Miscommunication between customer and seller
Inadequate quality control measures
Logistical issues during shipping and delivery

Q8. Program to find the second-highest number in a List, without sorting.

Ans.

Program to find the second-highest number in a List, without sorting.

Iterate through the list and keep track of the highest and second-highest numbers.
Compare each number with the highest and second-highest numbers and update accordingly.
Return the second-highest number.

Associate Data Engineer Jobs

Associate Data Engineer • 3-4 years

Maersk Global Service Centres India Pvt. Ltd.

•

4.2

Bangalore / Bengaluru

Senior Associate - Data Engineer (PySpark,Python) • 5-10 years

Pricewaterhouse Coopers Private Limited

•

3.4

Mumbai

Senior Associate Data Engineering L2 DE - Big Data GCP • 3-8 years

Publics Sapient

•

3.5

Bangalore / Bengaluru

View all Associate Data Engineer jobs

Q9. How is an empty class created in python.

Ans.

An empty class can be created in Python using the 'pass' keyword.

Use the 'class' keyword to define a class.
Add the class name and a colon after the 'class' keyword.
Use the 'pass' keyword to indicate an empty class body.

Q10. What is a Binary Search Tree? How does it work?

Ans.

A Binary Search Tree is a data structure where each node has at most two children and the left child is smaller than the parent while the right child is greater.

Nodes are arranged in a hierarchical order.
Searching, insertion, and deletion can be done in O(log n) time complexity.
In-order traversal of a BST gives the nodes in sorted order.
Example: 5 is the root node, 3 is its left child, and 7 is its right child. 2 is the left child of 3 and 4 is the right child of 3.
Example: S...read more

Q11. Given a node value in a linked list, delete the node.

Ans.

To delete a node in a linked list, update the pointers of the previous node to skip the node to be deleted.

Traverse the linked list to find the node to be deleted.
Update the pointers of the previous node to skip the node to be deleted.
Free the memory allocated to the node to be deleted.

Q12. SORT BY ORDER BY CLUSTER BY DISTRIBUTE BY

Ans.

SORT BY, ORDER BY, CLUSTER BY, and DISTRIBUTE BY are SQL clauses used for data sorting and partitioning.

SORT BY is used to sort the result set in ascending or descending order based on one or more columns.
ORDER BY is used to sort the result set in ascending or descending order based on one or more columns. It is similar to SORT BY but can be used with other clauses like LIMIT and OFFSET.
CLUSTER BY is used to group data based on a specific column. It is used to improve query p...read more

Q13. What are you experienced with in Python, ML, DL?

Ans.

Experienced in Python for data manipulation, ML for predictive modeling, and DL for deep learning algorithms.

Proficient in Python for data manipulation and analysis
Familiar with machine learning algorithms for predictive modeling
Knowledgeable in deep learning algorithms for image recognition tasks

Q14. What are python packages you have used?

Ans.

I have used python packages like Pandas, NumPy, Matplotlib, and Scikit-learn for data manipulation, analysis, visualization, and machine learning tasks.

Pandas - for data manipulation and analysis
NumPy - for numerical computing
Matplotlib - for data visualization
Scikit-learn - for machine learning tasks

Q15. What are KPI’s, how to monitor an application

Ans.

KPIs are Key Performance Indicators used to measure the success of an application. Monitoring can be done through tools like monitoring software and dashboards.

KPIs are specific metrics used to evaluate the performance of an application
Monitoring can be done using monitoring software like Nagios, Prometheus, or New Relic
Dashboards can be created to visualize KPIs in real-time
Examples of KPIs for an application could include response time, error rate, and throughput

Q16. Explain neural networks, backpropagation?

Ans.

Neural networks are a type of machine learning model that mimic the human brain. Backpropagation is an algorithm used to train neural networks.

Neural networks are composed of interconnected nodes called neurons.
Each neuron takes inputs, applies weights to them, and passes the result through an activation function.
Backpropagation is used to adjust the weights of the neurons in a neural network during training.
It works by calculating the error between the predicted output and t...read more

Q17. How well versed are you in Python, SQL?

Ans.

I am proficient in Python and SQL with experience in data manipulation and analysis.

Proficient in Python for data manipulation and analysis
Strong understanding of SQL for querying databases
Experience in writing complex SQL queries for data extraction and analysis

Q18. What was the difficut subject in college

Ans.

The most difficult subject in college was Advanced Calculus.

Advanced Calculus involved complex mathematical concepts and required a deep understanding of calculus principles.
The subject required a lot of practice and problem-solving skills to master the concepts.
Topics such as multivariable calculus, differential equations, and vector calculus were particularly challenging.
The abstract nature of the subject made it difficult to visualize and apply in real-world scenarios.

Q19. SQL queries to find second highest salary

Ans.

Use SQL query with subquery to find second highest salary

Use ORDER BY clause to sort salaries in descending order
Use LIMIT clause to get the second row after skipping the first row
Use a subquery to avoid duplicates if multiple employees have the same highest salary

Q20. What are the different joins present

Ans.

Different types of joins in SQL include inner join, left join, right join, and full outer join.

Inner join: Returns rows when there is a match in both tables
Left join: Returns all rows from the left table and the matched rows from the right table
Right join: Returns all rows from the right table and the matched rows from the left table
Full outer join: Returns rows when there is a match in either table

Q21. SQL QUERIES WITH WINDOW FUNCTION

Ans.

SQL queries with window functions

Window functions perform calculations across a set of rows that are related to the current row
Common window functions include ROW_NUMBER, RANK, DENSE_RANK, and NTILE
Window functions are used with the OVER() clause to define the window or subset of rows to perform the calculation on

Q22. Design a system for the stock market

Ans.

Design a system for the stock market

Define the scope of the system
Identify the data sources and types of data to be collected
Design a database schema to store the data
Develop algorithms for data analysis and prediction
Implement a user interface for traders to access the system
Ensure the system is secure and scalable

Q23. Program to reverse a linked list

Ans.

Program to reverse a linked list

Iterate through the linked list and change the direction of the pointers
Use three pointers to keep track of the current, previous and next nodes
Handle the edge cases of empty list and single node list

Q24. How do you remove duplicates in sql

Ans.

Use the DISTINCT keyword in a SELECT statement to remove duplicates in SQL.

Use the DISTINCT keyword in a SELECT statement to retrieve unique rows
Use GROUP BY clause with aggregate functions like COUNT() to remove duplicates based on specific columns
Use ROW_NUMBER() function with PARTITION BY clause to remove duplicates and keep only one row

Q25. Frequency of a number in a list

Ans.

Calculate the frequency of a number in a list

Iterate through the list and count occurrences of the number
Use a dictionary to store the count of each number
Return the count of the specified number

Q26. Optimizations I can use

Ans.

Optimizations for data engineering

Use indexing to speed up queries
Partition data to improve query performance
Use caching to reduce data retrieval time
Optimize data storage format for faster processing
Use parallel processing to speed up data processing
Optimize network bandwidth usage
Use compression to reduce storage and network usage

Q27. SQL queries to find duplicate values

Ans.

Use SQL queries with GROUP BY and HAVING clause to find duplicate values in a table.

Use GROUP BY clause to group the records based on the columns you want to check for duplicates.
Use HAVING clause to filter out the groups that have more than one record, indicating duplicates.
Example: SELECT column_name, COUNT(*) FROM table_name GROUP BY column_name HAVING COUNT(*) > 1;

Q28. What is data warehousing?

Ans.

Data warehousing is the process of collecting, storing, and managing data from various sources for analysis and reporting.

Data warehousing involves extracting data from multiple sources and consolidating it into a central repository.
It is used for data analysis, reporting, and decision-making purposes.
Data warehouses are designed for query and analysis rather than transaction processing.
Examples of data warehousing tools include Amazon Redshift, Snowflake, and Google BigQuery...read more

Q29. Pillars of Object oriented programing

Ans.

Pillars of OOP are Inheritance, Encapsulation, Abstraction, and Polymorphism.

Inheritance allows a class to inherit properties and behaviors from another class.
Encapsulation restricts access to certain components of an object, protecting its integrity.
Abstraction hides complex implementation details and only shows necessary features.
Polymorphism allows objects to be treated as instances of their parent class.

Q30. What is constructor?

Ans.

A constructor is a special method that is used to initialize objects in a class.

Constructors have the same name as the class they belong to.
They are called automatically when an object is created.
They can take parameters to initialize the object's properties.
If a class does not have a constructor, a default constructor is created.
Constructors can be overloaded to provide multiple ways to initialize objects.

Frequently asked in

TCS

LTIMindtree

Infosys

Q31. Types of Methods in python?

Ans.

Python has two types of methods: built-in methods and user-defined methods.

Built-in methods are pre-defined methods in Python, such as print() and len().
User-defined methods are created by the programmer to perform specific tasks.
Methods can be called on objects, such as strings or lists, using dot notation.
Methods can also take arguments, which are passed in parentheses after the method name.
Methods can return values using the return keyword.

Q32. Difference between table and views

Ans.

Tables store data in a structured format, while views are virtual tables that display data from one or more tables.

Tables store actual data, while views display data dynamically based on the underlying tables.
Tables can be modified directly, while views are read-only.
Views can combine data from multiple tables, while tables store data in a single structure.

Q33. Explain the different joins in SQL

Ans.

Different types of joins in SQL include inner join, outer join, left join, and right join.

Inner join: Returns rows when there is a match in both tables
Outer join: Returns all rows when there is a match in one of the tables
Left join: Returns all rows from the left table and the matched rows from the right table
Right join: Returns all rows from the right table and the matched rows from the left table

Q34. Stack and queue difference and use cases

Ans.

Stack and queue are data structures with different principles and use cases.

Stack is Last In First Out (LIFO) while queue is First In First Out (FIFO)
Stack is used for function calls, undo operations, and backtracking
Queue is used in BFS algorithms, printer queues, and messaging systems

Q35. What is SCD type 2?

Ans.

SCD type 2 stands for slowly changing dimension type 2, a method used in data warehousing to track historical data changes.

SCD type 2 is used to maintain historical data by creating new records for changes in dimension attributes.
It involves adding a new row to the dimension table with a new surrogate key for each change.
The old record is marked as inactive with an end date, while the new record has a start date.
This method allows for tracking changes over time and analyzing ...read more

Q36. Difference between DBMS and RDBMS?

Ans.

DBMS is a software system that manages databases, while RDBMS is a type of DBMS that stores data in a structured format using tables.

DBMS stands for Database Management System, while RDBMS stands for Relational Database Management System.
DBMS can manage any type of data, while RDBMS organizes data into tables with rows and columns.
RDBMS enforces relationships between tables using keys like primary and foreign keys.
Examples of DBMS include MongoDB, Oracle Database, and Microso...read more

Q37. SMALL FILE PROBLEM

Ans.

Small file problem refers to the issue of having a large number of small files in a storage system.

Small files can cause inefficiencies in storage and processing.
Solutions include consolidating small files into larger ones or using a different storage system.
Examples include Hadoop's SequenceFile format and Amazon S3's object size optimization.

Q38. RDS VA DF VS DS

Ans.

RDS, VA, DF, VS, and DS are all acronyms related to data engineering.

RDS stands for Relational Database Service, a managed database service by AWS.
VA stands for Virtual Assistant, a software program that can assist with tasks.
DF stands for Dataflow, a managed service by Google Cloud for data processing.
VS stands for Virtual Server, a server that runs on a virtual machine.
DS stands for Datastore, a NoSQL document database by Google Cloud.

Q39. Describe CAP Theorem

Ans.

CAP theorem states that a distributed system cannot guarantee consistency, availability, and partition tolerance at the same time.

Consistency: all nodes see the same data at the same time
Availability: every request receives a response, without guarantee that it contains the most recent version of the information
Partition tolerance: system continues to function even when network partitions occur
Examples: Cassandra prioritizes availability and partition tolerance over consisten...read more