AWS Data Engineer

30+ AWS Data Engineer Interview Questions and Answers

Updated 11 Jul 2025

Asked in Cognizant

4d ago

Q. Which libraries did you use in your project, and how did you import them?

Ans.

I have used various libraries in my project such as Pandas, NumPy, Matplotlib, etc. and I import them using the import statement.

I have used Pandas for data manipulation and analysis
I have used NumPy for numerical operations
I have used Matplotlib for data visualization
I import them using the import statement followed by the library name

Asked in TCS

6d ago

Q. Explain use of * in both the args and kwargs. Expedition handling.use of pass function in python

Ans.

The * symbol is used to pass variable number of arguments and keyword arguments in Python functions.

The * symbol before a parameter name in a function definition allows it to accept variable number of arguments as a tuple.
The ** symbol before a parameter name in a function definition allows it to accept variable number of keyword arguments as a dictionary.
The * symbol before an iterable object in a function call unpacks it into separate arguments.
The ** symbol before a dictio...read more

AWS Data Engineer Interview Questions and Answers for Freshers

View all interview questions

Asked in LTIMindtree

5d ago

Q. Modify null salary with avg salary, find count of employees by joining date. Configurations needed Glue job. What are connecters and Data connections in Glue service.

Ans.

Use Glue job to modify null salaries with average salary and find count of employees by joining date.

Create a Glue job to read data, modify null salaries with average salary, and count employees by joining date
Use Glue connectors to connect to data sources like S3, RDS, or Redshift
Data connections in Glue service are used to define the connection information to data sources
Example: Use Glue job to read employee data from S3, calculate average salary, replace null values, and ...read more

Asked in HARMAN

4d ago

Q. Write a program to find the greatest element from a list of elements within a window range of size 3.

Ans.

Program to find greatest element in a window of size 3

Iterate through the list and maintain a window of size 3
Find the maximum element in each window and store it in a separate list
Return the list of maximum elements

Are these interview questions helpful?

Asked in EXL Service

1d ago

Q. what is the devops? how both python is used? what can be done for 3 tier architecture

Ans.

DevOps is a software development approach that emphasizes collaboration, automation, and monitoring throughout the software development lifecycle.

DevOps is a combination of development and operations that aims to improve the speed and quality of software delivery.
Python is a popular language for DevOps tasks such as automation, configuration management, and testing.
In a 3-tier architecture, the presentation, application, and data layers are separated to improve scalability, f...read more

Asked in TCS

1d ago

Q. Given two tables, explain the different types of joins and how to determine the total count of the resulting table.

Ans.

Explaining joins and total count for two tables

Joins combine data from two or more tables based on a common column
Inner join returns only the matching rows from both tables
Left join returns all rows from the left table and matching rows from the right table
Right join returns all rows from the right table and matching rows from the left table
Full outer join returns all rows from both tables
Total count is the number of rows in the resulting joined table

AWS Data Engineer Jobs

AWS Data engineer • 5-10 years

Tata Consultancy Services

•

3.6

Kolkata

Genpact Hiring For AWS data engineer with Kafka, Spark • 5-10 years

Genpact

•

3.7

Hyderabad / Secunderabad

Python AWS Data Engineer -Genpact- Bangalore • 9-14 years

Genpact

•

3.7

Bangalore / Bengaluru

View all AWS Data Engineer jobs

Asked in Fractal Analytics

6d ago

Q. How do you extract the first letter of the first name from a column using a data manipulation package supported on Databricks?

Ans.

The function to extract the first letter of the firstname in a column varies based on the data manipulation package used.

Use SUBSTR function in SQL
Use str_extract function in R
Use substring function in Python

Asked in Cognizant

3d ago

Q. How do you parse JSON messages using Python?

Ans.

Python provides built-in libraries to parse JSON messages easily.

Use the json module to load and parse JSON data
Access the data using keys and indexes
Use loops to iterate through JSON arrays
Use dumps() method to convert Python objects to JSON format

Share interview questions and help millions of jobseekers 🌟

Asked in Photon Interactive

2d ago

Q. Explain how you handle large data processing in PySpark.

Ans.

Large data processing in Pyspark involves partitioning, caching, and optimizing transformations for efficient processing.

Partitioning data to distribute workload evenly across nodes
Caching intermediate results to avoid recomputation
Optimizing transformations to minimize shuffling and reduce data movement

Asked in Photon Interactive

3d ago

Q. Explain how you implement data governance in your company.

Ans.

Data governance is implemented through policies, processes, and tools to ensure data quality, security, and compliance.

Establish data governance policies and procedures to define roles, responsibilities, and processes for managing data
Implement data quality controls to ensure accuracy, completeness, and consistency of data
Utilize data security measures such as encryption, access controls, and monitoring to protect sensitive data
Enforce compliance with regulations and standard...read more

Asked in Cognizant

2d ago

Q. how to print current time in python

Ans.

To print current time in Python, use the datetime module.

Import the datetime module
Use the now() method to get the current date and time
Use the strftime() method to format the time as a string

Asked in LexisNexis

5d ago

Q. sql coding for dense rank and python coding for removing duplicates

Ans.

SQL dense rank and Python removing duplicates

For SQL dense rank, use the DENSE_RANK() function to assign a unique rank to each row within a partition
For Python removing duplicates, use the set() function to remove duplicates from a list or use pandas library to drop duplicates from a DataFrame

Asked in TCS

2d ago

Q. What AWS Glue Service and and It's Components

Ans.

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data for analytics.

AWS Glue Data Catalog: A central metadata repository that stores metadata information about datasets and sources.
AWS Glue ETL: Allows you to create ETL jobs to transform data from various sources into a format that can be easily analyzed.
AWS Glue Crawler: Automatically discovers data in various sources, infers schema, and updates the AWS Glue Data C...read more

Asked in Cognizant

4d ago

Q. How does AWS Glue work?

Ans.

AWS Glue is a fully managed ETL service that makes it easy to move data between data stores.

AWS Glue crawls your data sources and constructs a data catalog using metadata.
It then generates ETL code to transform and move data between data stores.
AWS Glue supports various data sources like Amazon S3, JDBC, Amazon RDS, etc.
It also provides a serverless environment to run your ETL jobs.
AWS Glue integrates with other AWS services like Amazon Athena, Amazon EMR, etc.

Asked in TCS

5d ago

Q. What is AWS and its architecture?

Ans.

AWS is a cloud computing platform provided by Amazon with a scalable and flexible architecture.

AWS stands for Amazon Web Services
It offers a wide range of cloud services including computing power, storage, and databases
AWS architecture is designed to be scalable, secure, and cost-effective
Examples of AWS services include EC2 for virtual servers, S3 for storage, and RDS for databases

Asked in Coditas Technologies

4d ago

Q. Write a Python code snippet to find repeated numbers in a list and their frequency.

Ans.

Python code to find repeated numbers and their concurrency

Iterate through the array and use a dictionary to store the count of each number
Iterate through the dictionary to find numbers with count greater than 1 and their concurrency

Asked in TCS

1d ago

Q. What is the equivalent of Data Factory in AWS?

Ans.

AWS Glue is the equivalent of Data Factory in AWS, providing ETL capabilities for data integration and transformation.

AWS Glue is a fully managed ETL (Extract, Transform, Load) service.
It allows users to prepare and transform data for analytics.
Glue can automatically discover and categorize data using its Data Catalog.
It supports various data sources like S3, RDS, and Redshift.
Glue jobs can be triggered on-demand or on a schedule.

Asked in Tech Mahindra

6d ago

Q. Explain how Lambda asynchronous invocation works.

Ans.

Lambda Asynchronous invocation allows decoupling of services by triggering functions without waiting for a response.

Lambda function is invoked by an event source like S3, SNS, or API Gateway.
The event source sends the event to Lambda, which queues the event for processing.
Lambda scales automatically to handle the incoming events concurrently.
The function processes the event and can send the result to another service or store it in a database.
Example: An S3 bucket triggers a L...read more

Asked in Cognizant

4d ago

Q. What is AWS Kinesis?

Ans.

AWS Kinesis is a managed service that enables real-time processing of streaming data at scale.

Kinesis can handle large amounts of data in real-time from various sources such as IoT devices, social media, and logs.
It allows data to be processed in real-time using AWS Lambda, Kinesis Analytics, or Kinesis Data Firehose.
Kinesis can be used for various use cases such as real-time analytics, machine learning, and fraud detection.
Kinesis provides various features such as data encry...read more

Asked in HARMAN

5d ago

Q. Write PySpark code to create a DataFrame from multiple lists.

Ans.

Creating a dataframe from multiple lists using PySpark code.

Import necessary libraries like pyspark.sql.
Create lists of data.
Create a SparkSession.
Convert lists to RDDs and then to a DataFrame.
Display the DataFrame.

Asked in TCS

5d ago

Q. What is Data Lake and Data warehouse

Ans.

Data Lake is a storage repository that holds a vast amount of raw data in its native format, while Data Warehouse is a structured repository for processed and analyzed data.

Data Lake stores raw, unstructured data in its original form for future processing and analysis.
Data Warehouse stores structured, processed data for querying and reporting.
Data Lake is ideal for big data analytics and machine learning applications.
Data Warehouse is optimized for complex queries and busines...read more

Asked in Coditas Technologies

6d ago

Q. Create a DAG to execute Python scripts in parallel.

Ans.

Create a Directed Acyclic Graph (DAG) to execute Python scripts in parallel.

Define tasks for each Python script to be executed in parallel
Use a parallelism parameter to specify the number of tasks to run concurrently
Set up dependencies between tasks to ensure proper execution order
Use a DAG scheduler like Apache Airflow to manage and execute the DAG

Asked in Cognizant

2d ago

Q. What are the differences between JSON and Parquet file formats?

Ans.

JSON is a text-based format for data interchange, while Parquet is a columnar storage format optimized for performance and efficiency.

JSON is human-readable, while Parquet is binary and more efficient for storage.
Parquet supports complex data types and nested structures, making it suitable for big data processing.
JSON files can be larger in size compared to Parquet files due to their text-based nature.
Parquet is optimized for read-heavy operations, making it ideal for analyti...read more

Asked in TCS

2d ago

Q. When should AWS Lambda and AWS Glue be used?

Ans.

Lambda for event-driven processing, Glue for ETL jobs

Use Lambda for event-driven processing, such as real-time data processing or triggering actions based on events
Use Glue for ETL (Extract, Transform, Load) jobs, especially when dealing with large volumes of data or complex transformations
Lambda is serverless and scales automatically, while Glue provides managed ETL capabilities with built-in connectors to various data sources
Consider using Lambda for small, quick tasks and ...read more

Asked in Fractal Analytics

4d ago

Q. Write a SQL query using window functions to solve a data analysis problem.

Ans.

Understanding SQL window functions for advanced data analysis and reporting.

Window functions perform calculations across a set of table rows related to the current row.
Common window functions include ROW_NUMBER(), RANK(), DENSE_RANK(), and SUM().
Example: SELECT employee_id, salary, RANK() OVER (ORDER BY salary DESC) AS salary_rank FROM employees;
Window functions can be used with PARTITION BY to segment data, e.g., SELECT department, employee_id, SUM(salary) OVER (PARTITION BY...read more

Asked in Hexaware Technologies

6d ago

Q. Write an AWS Lambda function from scratch.

Ans.

AWS Lambda is a serverless compute service that runs code in response to events, automatically managing the compute resources.

Event-Driven: AWS Lambda functions are triggered by events such as changes in data, HTTP requests via API Gateway, or messages from SQS.
Scalability: Lambda automatically scales your application by running code in response to each trigger, handling thousands of requests simultaneously.
Pay-as-You-Go: You only pay for the compute time you consume, with no...read more

Asked in Cognizant

3d ago

Q. Types of compression

Ans.

Types of compression include lossless and lossy compression.

Lossless compression reduces file size without losing any data.
Lossy compression reduces file size by removing some data, resulting in lower quality.
Examples of lossless compression include ZIP, GZIP, and PNG.
Examples of lossy compression include JPEG and MP3.

Asked in Deloitte

6d ago

Q. Explain partitioning and coalesce.

Ans.

Partitioning and coalesce are techniques used in data processing to optimize performance and manage data distribution.

Partitioning involves dividing data into smaller chunks based on a specified column or criteria, which helps in parallel processing and query optimization.
Coalesce is used to reduce the number of partitions by combining smaller partitions into larger ones, which can improve query performance and reduce overhead.
Partitioning is commonly used in distributed syst...read more

Asked in Photon Interactive

2d ago

Q. Name any two data lineage tools.

Ans.

Two data lineage tools are Apache Atlas and Informatica Enterprise Data Catalog.

Apache Atlas is an open source tool for metadata management and governance in Hadoop ecosystems.
Informatica Enterprise Data Catalog provides a comprehensive data discovery and metadata management solution.

Asked in Broadridge Financial Solutions

1d ago

Q. How do you flatten a JSON object?

Ans.

Flattening JSON involves transforming nested JSON structures into a simpler, one-dimensional format.

Use libraries like `json_normalize` in Python to flatten JSON. Example: `pd.json_normalize(data)`.
Flattening can be done by concatenating keys. For example, `{ 'a': { 'b': 1 } }` becomes `{ 'a_b': 1 }`.
Consider using AWS Glue for ETL processes to flatten JSON data in data lakes.
In JavaScript, you can use recursion to flatten JSON objects. Example: `function flatten(obj, prefix ...read more