i
Publicis
Sapient
Work with us
Filter interviews by
Dataproc clusters offer flexibility and control for complex workloads, while serverless jobs simplify management for straightforward tasks.
Dataproc allows for custom configurations, such as specific versions of Spark or Hadoop, which may be necessary for certain applications.
For large-scale data processing tasks that require fine-tuning of resources, a Dataproc cluster can be more efficient than serverless options...
Implementing strategies to prevent data loss in ETL pipelines is crucial for data integrity and reliability.
Implement data validation checks at each stage of the ETL process to ensure data integrity.
Use logging mechanisms to track data flow and identify any discrepancies or failures.
Incorporate retry mechanisms for failed data transfers to ensure data is not lost.
Utilize data backups and snapshots to restore data ...
Data partitioning in a pipeline involves dividing data into smaller chunks for processing and analysis.
Data can be partitioned based on a specific key or attribute, such as date, location, or customer ID.
Partitioning helps distribute data processing tasks across multiple nodes or servers for parallel processing.
Common partitioning techniques include range partitioning, hash partitioning, and list partitioning.
Exam...
Repartition increases the number of partitions in a DataFrame, while coalesce reduces the number of partitions without shuffling data.
Repartition involves a full shuffle of the data across the cluster, which can be expensive.
Coalesce minimizes data movement by only creating new partitions if necessary.
Repartition is typically used when increasing parallelism or evenly distributing data, while coalesce is used for ...
If a job fails in the pipeline and data processing cycle is over, it can lead to incomplete or inaccurate data.
Incomplete data may affect downstream processes and analysis
Data quality may be compromised if errors are not addressed
Monitoring and alerting systems should be in place to detect and handle failures
Re-running the failed job or implementing error handling mechanisms can help prevent issues in the future
SQL code to get the city1 city2 distance of table with repeating city1 and city2 values
Use a self join on the table to match city1 and city2
Calculate the distance between the cities using appropriate formula
Consider using a subquery if needed
Vaccum in delta tables helps improve performance by reclaiming space and optimizing file sizes.
Vaccum operation helps optimize file sizes by removing small files and compacting larger files.
It helps improve query performance by reducing the amount of data that needs to be scanned.
Vaccum operation can be scheduled to run periodically to maintain optimal performance.
It is recommended to run Vaccum on delta tables af...
DENSE_RANK() assigns ranks to rows with ties, ensuring no gaps in ranking values.
DENSE_RANK() is a window function that ranks rows within a partition.
Unlike RANK(), DENSE_RANK() does not skip rank values for ties.
Example: For scores 100, 100, 90, 80, DENSE_RANK() results in 1, 1, 2, 3.
Useful for generating leaderboards or sorting data without gaps.
I have handled terabytes of data in my POCs, including data from various sources and formats.
Handled terabytes of data in POCs
Worked with data from various sources and formats
Used tools like Hadoop, Spark, and SQL for data processing
Designing/configuring a cluster for 10 petabytes of data involves considerations for storage capacity, processing power, network bandwidth, and fault tolerance.
Consider using a distributed file system like HDFS or object storage like Amazon S3 to store and manage the large volume of data.
Implement a scalable processing framework like Apache Spark or Hadoop to efficiently process and analyze the data in parallel.
Ut...
I applied via LinkedIn and was interviewed in Nov 2024. There was 1 interview round.
Implementing strategies to prevent data loss in ETL pipelines is crucial for data integrity and reliability.
Implement data validation checks at each stage of the ETL process to ensure data integrity.
Use logging mechanisms to track data flow and identify any discrepancies or failures.
Incorporate retry mechanisms for failed data transfers to ensure data is not lost.
Utilize data backups and snapshots to restore data in ca...
Dataproc clusters offer flexibility and control for complex workloads, while serverless jobs simplify management for straightforward tasks.
Dataproc allows for custom configurations, such as specific versions of Spark or Hadoop, which may be necessary for certain applications.
For large-scale data processing tasks that require fine-tuning of resources, a Dataproc cluster can be more efficient than serverless options.
Data...
I applied via Recruitment Consulltant and was interviewed in Jul 2024. There were 2 interview rounds.
If a job fails in the pipeline and data processing cycle is over, it can lead to incomplete or inaccurate data.
Incomplete data may affect downstream processes and analysis
Data quality may be compromised if errors are not addressed
Monitoring and alerting systems should be in place to detect and handle failures
Re-running the failed job or implementing error handling mechanisms can help prevent issues in the future
Repartition increases the number of partitions in a DataFrame, while coalesce reduces the number of partitions without shuffling data.
Repartition involves a full shuffle of the data across the cluster, which can be expensive.
Coalesce minimizes data movement by only creating new partitions if necessary.
Repartition is typically used when increasing parallelism or evenly distributing data, while coalesce is used for reduc...
SQL code to get the city1 city2 distance of table with repeating city1 and city2 values
Use a self join on the table to match city1 and city2
Calculate the distance between the cities using appropriate formula
Consider using a subquery if needed
Data partitioning in a pipeline involves dividing data into smaller chunks for processing and analysis.
Data can be partitioned based on a specific key or attribute, such as date, location, or customer ID.
Partitioning helps distribute data processing tasks across multiple nodes or servers for parallel processing.
Common partitioning techniques include range partitioning, hash partitioning, and list partitioning.
Example: ...
I appeared for an interview in Feb 2025.
I applied via Approached by Company and was interviewed in Apr 2024. There was 1 interview round.
I have handled terabytes of data in my POCs, including data from various sources and formats.
Handled terabytes of data in POCs
Worked with data from various sources and formats
Used tools like Hadoop, Spark, and SQL for data processing
Repartition is used for increasing partitions for parallelism, while coalesce is used for decreasing partitions to reduce shuffling.
Repartition is used when there is a need for more partitions to increase parallelism.
Coalesce is used when there are too many partitions and need to reduce them to avoid shuffling.
Example: Repartition can be used before a join operation to evenly distribute data across partitions for bette...
Designing/configuring a cluster for 10 petabytes of data involves considerations for storage capacity, processing power, network bandwidth, and fault tolerance.
Consider using a distributed file system like HDFS or object storage like Amazon S3 to store and manage the large volume of data.
Implement a scalable processing framework like Apache Spark or Hadoop to efficiently process and analyze the data in parallel.
Utilize...
Pyspark Coding Test - 2 Questions
Vaccum in delta tables helps improve performance by reclaiming space and optimizing file sizes.
Vaccum operation helps optimize file sizes by removing small files and compacting larger files.
It helps improve query performance by reducing the amount of data that needs to be scanned.
Vaccum operation can be scheduled to run periodically to maintain optimal performance.
It is recommended to run Vaccum on delta tables after m...
DENSE_RANK() assigns ranks to rows with ties, ensuring no gaps in ranking values.
DENSE_RANK() is a window function that ranks rows within a partition.
Unlike RANK(), DENSE_RANK() does not skip rank values for ties.
Example: For scores 100, 100, 90, 80, DENSE_RANK() results in 1, 1, 2, 3.
Useful for generating leaderboards or sorting data without gaps.
It was coding test of pyspark
I applied via LinkedIn and was interviewed in Jun 2024. There was 1 interview round.
Pyspark interview questions. Askng to implementing window function .coding test one pyspark question reatime scenarios to do somw operations in pyspark
1 ques of pyspark based on time series
I applied via Naukri.com and was interviewed in Apr 2024. There were 2 interview rounds.
SQL coding test and spark
I applied via Recruitment Consulltant and was interviewed in Sep 2023. There were 2 interview rounds.
Use the find command with the -mtime option to find files that are 30 days old in Linux.
Use the find command with the -mtime option to specify the number of days.
For example, to find files that are exactly 30 days old: find /path/to/directory -mtime 30
To find files that are older than 30 days: find /path/to/directory -mtime +30
To find files that are newer than 30 days: find /path/to/directory -mtime -30
Use the COPY command in Redshift to load data from AWS S3.
Use the COPY command in Redshift to load data from S3 bucket.
Specify the IAM role with necessary permissions in the COPY command.
Provide the S3 file path and Redshift table name in the COPY command.
Ensure the Redshift cluster has the necessary permissions to access S3.
What people are saying about Publicis Sapient
Some of the top questions asked at the Publicis Sapient Data Engineer interview -
The duration of Publicis Sapient Data Engineer interview process can vary, but typically it takes about less than 2 weeks to complete.
based on 14 interview experiences
Difficulty level
Duration
based on 11 reviews
Rating in categories
Senior Associate
2.2k
salaries
| ₹16.8 L/yr - ₹32 L/yr |
Associate Technology L2
1.6k
salaries
| ₹9.1 L/yr - ₹18 L/yr |
Senior Associate Technology L1
1.4k
salaries
| ₹16.4 L/yr - ₹30 L/yr |
Senior Software Engineer
903
salaries
| ₹17.6 L/yr - ₹32 L/yr |
Senior Associate 2
664
salaries
| ₹23.8 L/yr - ₹42 L/yr |
Genpact
DXC Technology
Sutherland Global Services
Optum Global Solutions