PwC
10+ ShopSe Interview Questions and Answers
Q1. What is data flow? Difference with ADF pipeline and data flow
Data flow is a visual representation of data movement and transformation. ADF pipeline is a set of activities to move and transform data.
Data flow is a drag-and-drop interface to design data transformation logic
ADF pipeline is a set of activities to orchestrate data movement and transformation
Data flow is more flexible and powerful than ADF pipeline
Data flow can be used to transform data within a pipeline or as a standalone entity
Q2. What is afd? build dynamic pipeline spark arcticture sql data flow
AFD is not a commonly used term in data engineering. Can you provide more context?
Q3. What are the challenges you faced during migrating any data from one system to other?
Challenges faced during data migration include data loss, compatibility issues, downtime, and security concerns.
Data loss: Ensuring all data is successfully transferred without any loss or corruption.
Compatibility issues: Ensuring data formats, structures, and systems are compatible for seamless migration.
Downtime: Minimizing downtime during migration to avoid disruption to operations.
Security concerns: Ensuring data security and privacy are maintained throughout the migratio...read more
Q4. Is nested for each possible in ADF?
Yes, nested for each is possible in ADF.
Nested for each can be used to iterate through nested arrays or objects.
It can be used in mapping data flows and pipelines.
Example: For each customer, for each order, for each item in order.
It can improve performance by reducing the number of activities in a pipeline.
Q5. Difference between coalesce and reparation
Coalesce is used to return the first non-null value among its arguments, while reparation is not a standard function in SQL.
Coalesce is a standard SQL function, while reparation is not.
Coalesce returns the first non-null value among its arguments.
Reparation is not a standard SQL function and may refer to a custom function or process specific to a certain system or application.
Q6. how to delete duplicate from a database
To delete duplicates from a database, you can use SQL queries to identify and remove duplicate records.
Use the DISTINCT keyword in a SELECT query to retrieve unique records
Identify duplicate records using GROUP BY and HAVING clauses
Delete duplicate records using DELETE statement with subquery to keep only one instance
Q7. Repartition vs coalesce, dag vs lineage
Explanation of repartition vs coalesce and dag vs lineage in data engineering
Repartition: increases or decreases the number of partitions in a DataFrame or RDD
Coalesce: decreases the number of partitions in a DataFrame or RDD
DAG (Directed Acyclic Graph): a graph that represents the flow of data and operations in a Spark job
Lineage: the history of transformations that were applied to a RDD or DataFrame
Repartition is a shuffle operation and can be expensive, while coalesce is a...read more
Q8. Write code to print reverse of string.
Code to print reverse of string
Use a loop to iterate through the characters of the string in reverse order
Append each character to a new string to build the reversed string
Return the reversed string
Q9. Dataframes in Pyspark
Dataframes in Pyspark are distributed collections of data organized into named columns.
Dataframes are similar to tables in a relational database.
They can be created from various data sources like CSV, JSON, Parquet, etc.
Dataframes support SQL queries and transformations using PySpark functions.
Q10. Ready to travel on site
Yes, I am ready to travel on site for data engineering projects.
I am willing to travel for client meetings, project kick-offs, and on-site troubleshooting.
I understand the importance of face-to-face interactions in project delivery.
I have previous experience traveling for work, such as attending conferences or training sessions.
I am flexible with my schedule and can accommodate last-minute travel if needed.
Q11. Repartition vs coalease
Repartition is used to increase or decrease the number of partitions in a DataFrame, while coalesce is used to decrease the number of partitions without shuffling data.
Repartition involves shuffling data across the network, which can be expensive in terms of performance and resources.
Coalesce is a more efficient operation as it minimizes data movement by only creating new partitions if necessary.
Example: Repartition(10) will create 10 partitions in a DataFrame, while coalesce...read more
Q12. Copy Activity in ADF
Copy Activity in ADF is used to move data between supported data stores
Copy Activity is a built-in activity in Azure Data Factory (ADF)
It can be used to move data between supported data stores such as Azure Blob Storage, SQL Database, etc.
It supports various data movement methods like copy, transform, and load (ETL)
You can define source and sink datasets, mapping, and settings in Copy Activity
Example: Copying data from an on-premises SQL Server to Azure Data Lake Storage usin...read more
More about working at PwC
Interview Process at ShopSe
Top Data Engineer Interview Questions from Similar Companies
Reviews
Interviews
Salaries
Users/Month