Capgemini
APECO Infrastructure Interview Questions and Answers
Q1. How to read parquet file, how to call notebook from adf, Azure Devops CI/CD Process, system variables in adf
Answering questions related to Azure Data Engineer interview
To read parquet file, use PyArrow or Pandas library
To call notebook from ADF, use Notebook activity in ADF pipeline
For Azure DevOps CI/CD process, use Azure Pipelines
System variables in ADF can be accessed using expressions like @pipeline().RunId
Q2. difference between persist and cache in pyspark?
Persist and cache are both used for optimizing performance in PySpark, but persist stores data in memory and/or disk while cache only stores data in memory.
Persist allows you to specify storage level (memory, disk, etc.) while cache only stores data in memory
Persist is more flexible in terms of storage options compared to cache
Persist is used when you want to store data in memory and/or disk for future reuse, while cache is used for temporary storage in memory only
Q3. How you migrated oracle data into azure?
I migrated Oracle data into Azure using Azure Data Factory and Azure Database Migration Service.
Used Azure Data Factory to create pipelines for data migration
Utilized Azure Database Migration Service for schema and data migration
Ensured data consistency and integrity during the migration process
Q4. SDC 1 and SDC 2 in ADF explain with example
SDC 1 and SDC 2 in ADF are Self-Hosted Integration Runtimes used for data movement in Azure Data Factory.
SDC 1 and SDC 2 are Self-Hosted Integration Runtimes that allow data movement between on-premises and cloud data stores in Azure Data Factory.
SDC 1 is used for data movement in ADF pipelines and can be installed on an on-premises server.
SDC 2 is an updated version of SDC 1 with improved performance and scalability.
Both SDC 1 and SDC 2 provide secure and efficient data tran...read more
Q5. read a csv file in pyspark
Read a CSV file in PySpark
Use SparkSession to create a Spark DataFrame from the CSV file
Specify the file path and format when reading the CSV file
Use options like 'header' and 'inferSchema' to read the CSV file correctly
Q6. Remove duplicates
Use DISTINCT keyword in SQL to remove duplicates from a dataset.
Use SELECT DISTINCT column_name FROM table_name to retrieve unique values from a specific column.
Use SELECT DISTINCT * FROM table_name to retrieve unique rows from the entire table.
Use GROUP BY clause with COUNT() function to remove duplicates based on specific criteria.
More about working at Capgemini
Interview Process at APECO Infrastructure
Top Azure Data Engineer Interview Questions from Similar Companies
Reviews
Interviews
Salaries
Users/Month