i
Publicis Sapient
Filter interviews by
I applied via LinkedIn and was interviewed in Nov 2024. There was 1 interview round.
I applied via Recruitment Consulltant and was interviewed in Jul 2024. There were 2 interview rounds.
If a job fails in the pipeline and data processing cycle is over, it can lead to incomplete or inaccurate data.
Incomplete data may affect downstream processes and analysis
Data quality may be compromised if errors are not addressed
Monitoring and alerting systems should be in place to detect and handle failures
Re-running the failed job or implementing error handling mechanisms can help prevent issues in the future
Repartition increases the number of partitions in a DataFrame, while coalesce reduces the number of partitions without shuffling data.
Repartition involves a full shuffle of the data across the cluster, which can be expensive.
Coalesce minimizes data movement by only creating new partitions if necessary.
Repartition is typically used when increasing parallelism or evenly distributing data, while coalesce is used for reduc...
SQL code to get the city1 city2 distance of table with repeating city1 and city2 values
Use a self join on the table to match city1 and city2
Calculate the distance between the cities using appropriate formula
Consider using a subquery if needed
Data partitioning in a pipeline involves dividing data into smaller chunks for processing and analysis.
Data can be partitioned based on a specific key or attribute, such as date, location, or customer ID.
Partitioning helps distribute data processing tasks across multiple nodes or servers for parallel processing.
Common partitioning techniques include range partitioning, hash partitioning, and list partitioning.
Example: ...
I applied via Approached by Company and was interviewed in Apr 2024. There was 1 interview round.
I have handled terabytes of data in my POCs, including data from various sources and formats.
Handled terabytes of data in POCs
Worked with data from various sources and formats
Used tools like Hadoop, Spark, and SQL for data processing
Repartition is used for increasing partitions for parallelism, while coalesce is used for decreasing partitions to reduce shuffling.
Repartition is used when there is a need for more partitions to increase parallelism.
Coalesce is used when there are too many partitions and need to reduce them to avoid shuffling.
Example: Repartition can be used before a join operation to evenly distribute data across partitions for bette...
Designing/configuring a cluster for 10 petabytes of data involves considerations for storage capacity, processing power, network bandwidth, and fault tolerance.
Consider using a distributed file system like HDFS or object storage like Amazon S3 to store and manage the large volume of data.
Implement a scalable processing framework like Apache Spark or Hadoop to efficiently process and analyze the data in parallel.
Utilize...
Pyspark Coding Test - 2 Questions
Vaccum in delta tables helps improve performance by reclaiming space and optimizing file sizes.
Vaccum operation helps optimize file sizes by removing small files and compacting larger files.
It helps improve query performance by reducing the amount of data that needs to be scanned.
Vaccum operation can be scheduled to run periodically to maintain optimal performance.
It is recommended to run Vaccum on delta tables after m...
Publicis Sapient interview questions for designations
It was coding test of pyspark
Get interview-ready with Top Publicis Sapient Interview Questions
I applied via LinkedIn and was interviewed in Jun 2024. There was 1 interview round.
Pyspark interview questions. Askng to implementing window function .coding test one pyspark question reatime scenarios to do somw operations in pyspark
1 ques of pyspark based on time series
I applied via Naukri.com and was interviewed in Apr 2024. There were 2 interview rounds.
SQL coding test and spark
I applied via Recruitment Consulltant and was interviewed in Sep 2023. There were 2 interview rounds.
Use the find command with the -mtime option to find files that are 30 days old in Linux.
Use the find command with the -mtime option to specify the number of days.
For example, to find files that are exactly 30 days old: find /path/to/directory -mtime 30
To find files that are older than 30 days: find /path/to/directory -mtime +30
To find files that are newer than 30 days: find /path/to/directory -mtime -30
Use the COPY command in Redshift to load data from AWS S3.
Use the COPY command in Redshift to load data from S3 bucket.
Specify the IAM role with necessary permissions in the COPY command.
Provide the S3 file path and Redshift table name in the COPY command.
Ensure the Redshift cluster has the necessary permissions to access S3.
Some of the top questions asked at the Publicis Sapient Data Engineer interview -
based on 13 interviews
2 Interview rounds
based on 11 reviews
Rating in categories
Senior Associate
2.2k
salaries
| ₹11 L/yr - ₹40 L/yr |
Associate Technology L2
1.5k
salaries
| ₹6.5 L/yr - ₹20 L/yr |
Senior Associate Technology L1
1.2k
salaries
| ₹10 L/yr - ₹30 L/yr |
Senior Software Engineer
739
salaries
| ₹9.5 L/yr - ₹37 L/yr |
Senior Associate 2
622
salaries
| ₹14.1 L/yr - ₹41 L/yr |
Accenture
IBM
TCS
Infosys