Azure Data Engineer
100+ Azure Data Engineer Interview Questions and Answers

Asked in TCS

Q. How can we load multiple (50) tables at a time using ADF?
You can load multiple tables at a time using Azure Data Factory by creating a single pipeline with multiple copy activities.
Create a pipeline in Azure Data Factory
Add multiple copy activities to the pipeline, each copy activity for loading data from one table
Configure each copy activity to load data from a different table
Run the pipeline to load data from all tables simultaneously

Asked in Tech Mahindra

Q. If both ADF and Databricks can achieve similar functionalities like data transformation, fetching data, and loading the dimension layer, why use Databricks? What is the rationale behind having both?
Databricks enhances data processing with advanced analytics, collaboration, and scalability beyond ADF's capabilities.
Databricks provides a collaborative environment for data scientists and engineers to work together using notebooks.
It supports advanced analytics and machine learning workflows, which ADF lacks natively.
Databricks can handle large-scale data processing with Apache Spark, making it more efficient for big data tasks.
With Databricks, you can leverage Delta Lake f...read more
Azure Data Engineer Interview Questions and Answers for Freshers

Asked in KPMG India

Q. Difference between RDD, Dataframe and Dataset. How and what you have used in you databricks for data anlysis
RDD, Dataframe and Dataset are data structures in Spark. RDD is a low-level structure, Dataframe is tabular and Dataset is a combination of both.
RDD stands for Resilient Distributed Datasets and is a low-level structure in Spark that is immutable and fault-tolerant.
Dataframe is a tabular structure with named columns and is similar to a table in a relational database.
Dataset is a combination of RDD and Dataframe and provides type-safety and object-oriented programming features...read more

Asked in Techigai

Q. What is incremental load and other types of loads? How do you implement incremental load in your ADF pipeline?
Incremental load is a process of updating only the new or changed data in a target system. Other types of loads include full load and delta load.
Incremental load reduces the time and resources required to update the target system.
Full load loads all the data from the source system to the target system.
Delta load loads only the changed data from the source system to the target system.
In ADF pipeline, incremental load can be implemented using watermarking or change data capture...read more

Asked in TCS

Q. Show me the details of newly joined employees based on two tables: the employee table and the salary table, where employee ID is the foreign key.
Retrieve newly joined employee details by joining employee and salary tables using Emp ID as the foreign key.
Use SQL JOIN to combine employee and salary tables on Emp ID.
Example SQL query: SELECT e.*, s.salary FROM employee e JOIN salary s ON e.EmpID = s.EmpID WHERE e.joining_date > '2023-01-01';
Ensure to filter based on the joining date to get newly joined employees.
Consider using LEFT JOIN if you want to include employees without salary records.

Asked in Tech Mahindra

Q. How do you design an effective ADF pipeline, and what metrics and considerations should you keep in mind while designing?
Designing an effective ADF pipeline involves considering various metrics and factors.
Understand the data sources and destinations
Identify the dependencies between activities
Optimize data movement and processing for performance
Monitor and track pipeline execution for troubleshooting
Consider security and compliance requirements
Use parameterization and dynamic content for flexibility
Implement error handling and retries for robustness
Azure Data Engineer Jobs




Asked in Tech Mahindra

Q. Let's say you have a customers table with customerID and customer name, and an orders table with OrderId and CustomerID. Write a query to find the customer name(s) of the customer(s) who placed the maximum numb...
read moreQuery to find customer names with the maximum orders from Customers and Orders tables.
Use JOIN to combine Customers and Orders tables on CustomerID.
Group by CustomerID and count orders to find the maximum.
Use a subquery to filter customers with the maximum order count.
Example SQL: SELECT c.customerName FROM Customers c JOIN Orders o ON c.customerID = o.CustomerID GROUP BY c.customerID HAVING COUNT(o.OrderId) = (SELECT MAX(orderCount) FROM (SELECT COUNT(OrderId) AS orderCount ...read more

Asked in TCS

Q. What is the get metadata activity and what parameters do we have to pass?
Get metadata activity is used to retrieve metadata of a specified data store or dataset in Azure Data Factory.
Get metadata activity is used in Azure Data Factory to retrieve metadata of a specified data store or dataset.
Parameters to pass include dataset, linked service, and optional folder path.
The output of the activity includes information like schema, size, last modified timestamp, etc.
Example: Get metadata of a SQL Server table using a linked service to the database.
Share interview questions and help millions of jobseekers 🌟

Asked in Tech Mahindra

Q. Suppose you have table 1 with values 1, 2, 3, 5, null, null, 0 and table 2 has null, 2, 4, 7, 3, 5. What would be the output after an inner join?
The output after inner join of table 1 and table 2 will be 2,3,5.
Inner join only includes rows that have matching values in both tables.
Values 2, 3, and 5 are present in both tables, so they will be included in the output.
Null values are not considered as matching values in inner join.

Asked in Tech Mahindra

Q. How would you slice and dice a very large data volume to boost performance?
Optimize data processing by partitioning, indexing, and using efficient storage formats.
Partitioning: Divide large datasets into smaller, manageable chunks. For example, partitioning a sales dataset by year.
Indexing: Create indexes on frequently queried columns to speed up data retrieval. For instance, indexing customer IDs in a transaction table.
Data Compression: Use compressed formats like Parquet or ORC to reduce storage size and improve I/O performance.
Caching: Implement ...read more

Asked in KPMG India

Q. What are key components in ADF? What all you have used in your pipeline?
ADF key components include pipelines, activities, datasets, triggers, and linked services.
Pipelines - logical grouping of activities
Activities - individual tasks within a pipeline
Datasets - data sources and destinations
Triggers - event-based or time-based execution of pipelines
Linked Services - connections to external data sources
Examples: Copy Data activity, Lookup activity, Blob Storage dataset

Asked in TCS

Q. How can we monitor the child pipeline in the master pipeline?
You can monitor the child pipeline in the master pipeline by using Azure Monitor or Azure Data Factory monitoring tools.
Use Azure Monitor to track the performance and health of the child pipeline within the master pipeline.
Leverage Azure Data Factory monitoring tools to view detailed logs and metrics for the child pipeline execution.
Set up alerts and notifications to be informed of any issues or failures in the child pipeline.

Asked in Techigai

Q. What are the error handling mechanisms in ADF pipelines?
ADF pipelines have several error handling mechanisms to ensure data integrity and pipeline reliability.
ADF provides built-in retry mechanisms for transient errors such as network connectivity issues or service outages.
ADF also supports custom error handling through the use of conditional activities and error outputs.
Error outputs can be used to redirect failed data to a separate pipeline or storage location for further analysis.
ADF also provides logging and monitoring capabil...read more

Asked in KPMG India

Q. Do you create any encryprion key in Databricks? Cluster size in Databricks.
Yes, encryption keys can be created in Databricks. Cluster size can be adjusted based on workload.
Encryption keys can be created using Azure Key Vault or Databricks secrets
Cluster size can be adjusted manually or using autoscaling based on workload
Encryption at rest can also be enabled for data stored in Databricks

Asked in Newt Global

Q. An on-premise Oracle server has a daily incremental data increase of 10 GB. How would you migrate this data to the cloud using Azure?
Use Azure Data Factory to move data from on-premise Oracle server to Azure cloud.
Create a linked service to connect to the on-premise Oracle server
Create a linked service to connect to the Azure cloud storage
Create a pipeline with a copy activity to move data from on-premise to cloud
Schedule the pipeline to run daily
Monitor the pipeline for any errors or issues

Asked in Capgemini

Q. How to read parquet file, how to call notebook from adf, Azure Devops CI/CD Process, system variables in adf
Answering questions related to Azure Data Engineer interview
To read parquet file, use PyArrow or Pandas library
To call notebook from ADF, use Notebook activity in ADF pipeline
For Azure DevOps CI/CD process, use Azure Pipelines
System variables in ADF can be accessed using expressions like @pipeline().RunId

Asked in Deloitte

Q. Explain about copy activity in ADF Slowly changing dimensions Data warehousing
Copy activity in ADF is used to move data from source to destination.
Copy activity supports various sources and destinations such as Azure Blob Storage, Azure SQL Database, etc.
It can be used for both one-time and scheduled data movement.
It supports mapping data between source and destination using mapping data flows.
Slowly changing dimensions can be handled using copy activity in ADF.
Copy activity is commonly used in data warehousing scenarios.

Asked in TCS

Q. 2. Do you know data bricks? And from when you are working on it?
Yes, I am familiar with Databricks and have been working on it for the past 2 years.
I have been using Databricks for data engineering tasks such as data processing, data transformation, and data visualization.
I have experience in building and optimizing data pipelines using Databricks.
I have worked on collaborative projects with team members using Databricks notebooks.
I have utilized Databricks for big data processing and analysis, leveraging its scalability and performance c...read more

Asked in TCS

Q. What is the difference between Blob storage and Azure Data Lake Storage (ADLS)?
Blob is a storage service for unstructured data, while ADLS is a distributed file system for big data analytics.
Blob is a general-purpose object storage service for unstructured data, while ADLS is optimized for big data analytics workloads.
Blob storage is suitable for storing large amounts of data, such as images, videos, and logs, while ADLS is designed for processing large datasets in parallel.
ADLS offers features like hierarchical namespace, POSIX-compliant file system se...read more

Asked in Tech Mahindra

Q. How would you reconstruct a table while preserving historical data, referring to Slowly Changing Dimensions (SCD)?
Use Slowly Changing Dimensions (SCD) to preserve historical data while reconstructing a table.
Implement SCD Type 1 for overwriting old data without keeping history.
Use SCD Type 2 to create new records for changes, preserving history.
Example of SCD Type 2: If a customer's address changes, add a new record with the new address and mark the old record as inactive.
SCD Type 3 allows for limited history by adding new columns for previous values.
Choose the appropriate SCD type based...read more

Asked in Wipro

Q. How do you connect Databricks with a storage account?
To connect data bricks with storage account, you need to create a storage account and configure it in the data bricks workspace.
Create a storage account in Azure portal
Get the connection string of the storage account
In the data bricks workspace, go to 'Storage' and click on 'Create'
Select the storage account type and provide the connection string
Test the connection and save the configuration

Asked in Capgemini

Q. How would you create a pipeline for ADLS to SQL data movement?
Create a data pipeline using Azure Data Factory to move data from ADLS to SQL Database.
1. Set up Azure Data Factory (ADF) instance in your Azure portal.
2. Create linked services for both Azure Data Lake Storage (ADLS) and SQL Database.
3. Define datasets for the source (ADLS) and destination (SQL Database).
4. Create a pipeline in ADF and add a copy activity to move data.
5. Configure the copy activity to specify source and destination datasets.
6. Set up triggers for scheduling ...read more

Asked in Capgemini

Q. If data needs filtering, joining, and aggregation, how would you do it with ADF?
Use Azure Data Factory (ADF) to filter, join, and aggregate data through data flows and pipelines.
Utilize Data Flows in ADF for data transformation tasks.
Use the Filter transformation to remove unwanted rows based on conditions.
Employ the Join transformation to combine datasets on common keys, e.g., joining patient records with treatment data.
Implement Aggregate transformation to summarize data, such as calculating average treatment costs per patient.
Schedule ADF pipelines to...read more

Asked in PwC

Q. What types of triggers are available in ADF?
There are three types of triggers available in Azure Data Factory: Schedule, Tumbling Window, and Event.
Schedule trigger: Runs pipelines on a specified schedule.
Tumbling Window trigger: Runs pipelines at specified time intervals.
Event trigger: Runs pipelines in response to events like a file being added to a storage account.

Asked in Newt Global

Q. How do you design/implement database solutions in the cloud?
Designing/implementing database solutions in the cloud involves selecting appropriate cloud service, data modeling, security, and scalability.
Select the appropriate cloud service (e.g. Azure SQL Database, Cosmos DB, etc.) based on the requirements of the application
Design the data model to optimize for the cloud environment (e.g. denormalization, partitioning, etc.)
Implement security measures such as encryption, access control, and auditing
Ensure scalability by using features...read more

Asked in Insight Global Technologies

Q. Write a SQL query to fetch the Customers who have not done any transaction in the last 30 days but did before 30 days.
SQL query to fetch customers who have not transacted in last 30 days but did before
Use a subquery to filter customers who transacted before 30 days
Use NOT IN or NOT EXISTS to exclude customers who transacted in last 30 days

Asked in TCS

Q. How would you convince a client to migrate to the cloud?
Migrating to the cloud offers numerous benefits such as cost savings, scalability, and improved security.
Highlight the cost savings that can be achieved by migrating to the cloud, as clients can avoid upfront infrastructure costs and pay only for the resources they use.
Emphasize the scalability of cloud services, allowing clients to easily scale up or down based on their needs without the need for additional hardware investments.
Discuss the improved security measures provided...read more

Asked in PwC

Q. Write a SQL query to find the fourth highest salary of an employee from an employee table.
To find the fourth highest salary from an employee table using SQL, we can use various methods like subqueries or the DISTINCT clause.
Use the DISTINCT keyword to get unique salaries: SELECT DISTINCT salary FROM employees ORDER BY salary DESC LIMIT 1 OFFSET 3;
Alternatively, use a subquery: SELECT MAX(salary) FROM employees WHERE salary < (SELECT MAX(salary) FROM employees WHERE salary < (SELECT MAX(salary) FROM employees WHERE salary < (SELECT MAX(salary) FROM employees)));
Ano...read more

Asked in TCS

Q. How do you read files in notebook What are configuration needed to read data Why you have not used adf trigger only What is parquet format Window functions vs group by How to read a CSV file and store it in par...
read moreReading files in notebook, configuring data, using ADF trigger, parquet format, window functions vs group by, reading CSV file and storing in parquet, dataset vs dataframe, transformations, delta lake
To read files in notebook, use libraries like pandas or pyspark
Configuration needed includes specifying file path, format, and any additional options
ADF trigger can be used for automated data processing, but may not be necessary for all scenarios
Parquet format is a columnar stora...read more

Asked in Accenture

Q. What is the difference between a scheduled trigger and a tumbling window trigger?
Scheduled trigger is time-based while tumbling window trigger is data-based.
Scheduled trigger is based on a specific time or interval, such as every hour or every day.
Tumbling window trigger is based on the arrival of new data or a specific event.
Scheduled trigger is useful for regular data processing tasks, like ETL jobs.
Tumbling window trigger is useful for aggregating data over fixed time intervals.
Scheduled trigger can be set to run at a specific time, while tumbling wind...read more
Interview Questions of Similar Designations
Interview Experiences of Popular Companies





Top Interview Questions for Azure Data Engineer Related Skills

Calculate your in-hand salary
Confused about how your in-hand salary is calculated? Enter your annual salary (CTC) and get your in-hand salary


Reviews
Interviews
Salaries
Users

