Filter interviews by
Data skewness is a measure of asymmetry in the distribution of data values.
Data skewness indicates the lack of symmetry in the data distribution.
Positive skewness means the tail on the right side of the distribution is longer or fatter.
Negative skewness means the tail on the left side of the distribution is longer or fatter.
Skewness value of 0 indicates a perfectly symmetrical distribution.
Transformations are operations performed on data to convert it from one form to another. There are mainly two types of transformations: narrow and wide.
Transformations are operations performed on data to convert it from one form to another.
Narrow transformations are those where each input partition will contribute to only one output partition, e.g., map, filter.
Wide transformations are those where each input parti...
OOM stands for Out Of Memory and driverhead memory refers to the memory allocated to the driver in a Spark application.
OOM occurs when a system runs out of memory to allocate for processes, leading to crashes or performance issues.
Driverhead memory in Spark is the memory allocated to the driver program, which coordinates tasks and manages the overall execution of the application.
Adjusting memory settings like exec...
Spark job process involves job submission, DAG creation, task scheduling, and task execution.
Spark job is submitted to the SparkContext by the user.
Spark creates a Directed Acyclic Graph (DAG) of the job's stages and tasks.
Tasks are scheduled by the Spark scheduler based on data locality and resource availability.
Tasks are executed on worker nodes in the cluster.
Output is collected and returned to the user.
What people are saying about PwC
Coalesce and repartition are concepts used in data processing to control the number of partitions in a dataset.
Coalesce is used to reduce the number of partitions in a dataset without shuffling the data, which can improve performance.
Repartition is used to increase or decrease the number of partitions in a dataset by shuffling the data across the cluster.
Coalesce is preferred over repartition when reducing partiti...
Use SQL query with ORDER BY and LIMIT to find the third highest salary from a table.
Use ORDER BY clause to sort salaries in descending order
Use LIMIT 1 OFFSET 2 to skip the first two highest salaries
Example: SELECT salary FROM employees ORDER BY salary DESC LIMIT 1 OFFSET 2
SCD Type II allows tracking historical changes in data by creating new records instead of overwriting existing ones.
Maintains a full history of changes to data over time.
Each change creates a new record with a start and end date.
Example: If a customer's address changes, a new record is created with the new address and a timestamp.
Useful in scenarios where historical accuracy is crucial, such as in financial or cus...
Coalesce is used to return the first non-null value among its arguments, while reparation is not a standard function in SQL.
Coalesce is a standard SQL function, while reparation is not.
Coalesce returns the first non-null value among its arguments.
Reparation is not a standard SQL function and may refer to a custom function or process specific to a certain system or application.
The SQL LAG function retrieves data from a previous row in a result set, useful for comparisons.
LAG function syntax: LAG(column_name, offset, default) OVER (PARTITION BY column ORDER BY column).
Example: SELECT date, sales, LAG(sales, 1) OVER (ORDER BY date) AS previous_sales FROM sales_data;
Useful for calculating differences, trends, or changes over time.
Can be used in financial analysis to compare current and pre...
Yes, I am ready to travel on site for data engineering projects.
I am willing to travel for client meetings, project kick-offs, and on-site troubleshooting.
I understand the importance of face-to-face interactions in project delivery.
I have previous experience traveling for work, such as attending conferences or training sessions.
I am flexible with my schedule and can accommodate last-minute travel if needed.
I appeared for an interview in Dec 2024.
Spark is a fast and general-purpose cluster computing system for big data processing.
Spark provides APIs in Java, Scala, Python, and R for distributed data processing.
It includes components like Spark SQL for SQL and structured data processing, Spark Streaming for real-time data processing, MLlib for machine learning, and GraphX for graph processing.
Spark can run on top of Hadoop, Mesos, Kubernetes, or in standalone mo...
Transformations are operations performed on data to convert it from one form to another. There are mainly two types of transformations: narrow and wide.
Transformations are operations performed on data to convert it from one form to another.
Narrow transformations are those where each input partition will contribute to only one output partition, e.g., map, filter.
Wide transformations are those where each input partition ...
Spark job process involves job submission, DAG creation, task scheduling, and task execution.
Spark job is submitted to the SparkContext by the user.
Spark creates a Directed Acyclic Graph (DAG) of the job's stages and tasks.
Tasks are scheduled by the Spark scheduler based on data locality and resource availability.
Tasks are executed on worker nodes in the cluster.
Output is collected and returned to the user.
Coalesce and repartition are concepts used in data processing to control the number of partitions in a dataset.
Coalesce is used to reduce the number of partitions in a dataset without shuffling the data, which can improve performance.
Repartition is used to increase or decrease the number of partitions in a dataset by shuffling the data across the cluster.
Coalesce is preferred over repartition when reducing partitions t...
OOM stands for Out Of Memory and driverhead memory refers to the memory allocated to the driver in a Spark application.
OOM occurs when a system runs out of memory to allocate for processes, leading to crashes or performance issues.
Driverhead memory in Spark is the memory allocated to the driver program, which coordinates tasks and manages the overall execution of the application.
Adjusting memory settings like executor ...
Data skewness is a measure of asymmetry in the distribution of data values.
Data skewness indicates the lack of symmetry in the data distribution.
Positive skewness means the tail on the right side of the distribution is longer or fatter.
Negative skewness means the tail on the left side of the distribution is longer or fatter.
Skewness value of 0 indicates a perfectly symmetrical distribution.
I applied via LinkedIn and was interviewed in Sep 2024. There were 2 interview rounds.
Coalesce is used to return the first non-null value among its arguments, while reparation is not a standard function in SQL.
Coalesce is a standard SQL function, while reparation is not.
Coalesce returns the first non-null value among its arguments.
Reparation is not a standard SQL function and may refer to a custom function or process specific to a certain system or application.
The SQL LAG function retrieves data from a previous row in a result set, useful for comparisons.
LAG function syntax: LAG(column_name, offset, default) OVER (PARTITION BY column ORDER BY column).
Example: SELECT date, sales, LAG(sales, 1) OVER (ORDER BY date) AS previous_sales FROM sales_data;
Useful for calculating differences, trends, or changes over time.
Can be used in financial analysis to compare current and previous...
I applied via Naukri.com and was interviewed in Dec 2024. There was 1 interview round.
Use SQL query with ORDER BY and LIMIT to find the third highest salary from a table.
Use ORDER BY clause to sort salaries in descending order
Use LIMIT 1 OFFSET 2 to skip the first two highest salaries
Example: SELECT salary FROM employees ORDER BY salary DESC LIMIT 1 OFFSET 2
repartition vs coalesce, persist vs cache
repartition is used to increase or decrease the number of partitions in a DataFrame, while coalesce is used to decrease the number of partitions without shuffling
persist is used to persist the DataFrame in memory or disk for faster access, while cache is a shorthand for persisting the DataFrame in memory only
repartition example: df.repartition(10)
coalesce example: df.coalesce(5)
...
I applied via Naukri.com and was interviewed in Nov 2024. There was 1 interview round.
SCD Type II allows tracking historical changes in data by creating new records instead of overwriting existing ones.
Maintains a full history of changes to data over time.
Each change creates a new record with a start and end date.
Example: If a customer's address changes, a new record is created with the new address and a timestamp.
Useful in scenarios where historical accuracy is crucial, such as in financial or customer...
I applied via Naukri.com and was interviewed in Jul 2024. There were 4 interview rounds.
Aptitude was Okay. Time given was less.
Dataframes in Pyspark are distributed collections of data organized into named columns.
Dataframes are similar to tables in a relational database.
They can be created from various data sources like CSV, JSON, Parquet, etc.
Dataframes support SQL queries and transformations using PySpark functions.
Find the second highest salary from a list of employee salaries using Python.
Use a set to remove duplicates from the salary list.
Sort the unique salaries in descending order.
Access the second element in the sorted list to get the second highest salary.
Example: salaries = [3000, 2000, 3000, 4000]; unique_salaries = sorted(set(salaries), reverse=True); second_highest = unique_salaries[1].
Yes, I am ready to travel on site for data engineering projects.
I am willing to travel for client meetings, project kick-offs, and on-site troubleshooting.
I understand the importance of face-to-face interactions in project delivery.
I have previous experience traveling for work, such as attending conferences or training sessions.
I am flexible with my schedule and can accommodate last-minute travel if needed.
I applied via AmbitionBox and was interviewed in Jan 2024. There was 1 interview round.
Code to print reverse of string
Use a loop to iterate through the characters of the string in reverse order
Append each character to a new string to build the reversed string
Return the reversed string
I applied via Referral and was interviewed before Jul 2023. There were 2 interview rounds.
Understanding SQL's ROW_NUMBER() and RANK() functions for data ranking and ordering.
ROW_NUMBER() assigns a unique sequential integer to rows within a partition, starting at 1.
RANK() assigns a rank to each row within a partition, with gaps for ties (e.g., 1, 1, 3).
Example of ROW_NUMBER(): SELECT name, ROW_NUMBER() OVER (ORDER BY score DESC) AS rank FROM players;
Example of RANK(): SELECT name, RANK() OVER (ORDER BY score...
To delete duplicates from a database, you can use SQL queries to identify and remove duplicate records.
Use the DISTINCT keyword in a SELECT query to retrieve unique records
Identify duplicate records using GROUP BY and HAVING clauses
Delete duplicate records using DELETE statement with subquery to keep only one instance
Joining PwC offers opportunities for growth, innovation, and collaboration in a leading global professional services firm.
Reputation: PwC is recognized as one of the Big Four accounting firms, providing a strong foundation for career development and networking.
Diverse Projects: Working at PwC allows engagement in a variety of projects across industries, enhancing skills and experience in data engineering.
Innovation Foc...
I bring a blend of technical skills, problem-solving abilities, and strong communication to my role as a Data Engineer.
Strong analytical skills: I excel at analyzing complex datasets to derive actionable insights, as demonstrated in my previous project where I optimized data pipelines.
Proficiency in programming languages: I am skilled in Python and SQL, which I used to automate data processing tasks, reducing processin...
I applied via Job Portal
Repartition is used to increase or decrease the number of partitions in a DataFrame, while coalesce is used to decrease the number of partitions without shuffling data.
Repartition involves shuffling data across the network, which can be expensive in terms of performance and resources.
Coalesce is a more efficient operation as it minimizes data movement by only creating new partitions if necessary.
Example: Repartition(10...
Copy Activity in ADF is used to move data between supported data stores
Copy Activity is a built-in activity in Azure Data Factory (ADF)
It can be used to move data between supported data stores such as Azure Blob Storage, SQL Database, etc.
It supports various data movement methods like copy, transform, and load (ETL)
You can define source and sink datasets, mapping, and settings in Copy Activity
Example: Copying data from...
Some of the top questions asked at the PwC Data Engineer interview -
The duration of PwC Data Engineer interview process can vary, but typically it takes about less than 2 weeks to complete.
based on 16 interview experiences
Difficulty level
Duration
based on 33 reviews
Rating in categories
Senior Associate
19k
salaries
| ₹12.7 L/yr - ₹25 L/yr |
Associate
15.1k
salaries
| ₹7.9 L/yr - ₹14.5 L/yr |
Manager
7.6k
salaries
| ₹22.1 L/yr - ₹40 L/yr |
Senior Consultant
4.9k
salaries
| ₹15.9 L/yr - ₹26.3 L/yr |
Associate2
4.7k
salaries
| ₹7.5 L/yr - ₹14 L/yr |
Deloitte
Ernst & Young
Accenture
TCS