Publicis Sapient
10+ MILEAGE LOGISTICS Interview Questions and Answers
Q1. What will happen if job has failed in pipeline and data processing cycle is over?
If a job fails in the pipeline and data processing cycle is over, it can lead to incomplete or inaccurate data.
Incomplete data may affect downstream processes and analysis
Data quality may be compromised if errors are not addressed
Monitoring and alerting systems should be in place to detect and handle failures
Re-running the failed job or implementing error handling mechanisms can help prevent issues in the future
Q2. What Volume of data have you handled in your POCs ?
I have handled terabytes of data in my POCs, including data from various sources and formats.
Handled terabytes of data in POCs
Worked with data from various sources and formats
Used tools like Hadoop, Spark, and SQL for data processing
Q3. write sql code to get the city1 city2 distance of table if city1 and city2 tables can repeat
SQL code to get the city1 city2 distance of table with repeating city1 and city2 values
Use a self join on the table to match city1 and city2
Calculate the distance between the cities using appropriate formula
Consider using a subquery if needed
Q4. what is difference repartition and coalesce
Repartition increases the number of partitions in a DataFrame, while coalesce reduces the number of partitions without shuffling data.
Repartition involves a full shuffle of the data across the cluster, which can be expensive.
Coalesce minimizes data movement by only creating new partitions if necessary.
Repartition is typically used when increasing parallelism or evenly distributing data, while coalesce is used for reducing the number of partitions without a full shuffle.
Exampl...read more
Q5. How will you design/configure a cluster if you have given 10 petabytes of data.
Designing/configuring a cluster for 10 petabytes of data involves considerations for storage capacity, processing power, network bandwidth, and fault tolerance.
Consider using a distributed file system like HDFS or object storage like Amazon S3 to store and manage the large volume of data.
Implement a scalable processing framework like Apache Spark or Hadoop to efficiently process and analyze the data in parallel.
Utilize a cluster management system like Apache Mesos or Kubernet...read more
Q6. When will you decide to use repartition and coalesce?
Repartition is used for increasing partitions for parallelism, while coalesce is used for decreasing partitions to reduce shuffling.
Repartition is used when there is a need for more partitions to increase parallelism.
Coalesce is used when there are too many partitions and need to reduce them to avoid shuffling.
Example: Repartition can be used before a join operation to evenly distribute data across partitions for better performance.
Example: Coalesce can be used after a filter...read more
Q7. how is data partitioned in pipeline
Data partitioning in a pipeline involves dividing data into smaller chunks for processing and analysis.
Data can be partitioned based on a specific key or attribute, such as date, location, or customer ID.
Partitioning helps distribute data processing tasks across multiple nodes or servers for parallel processing.
Common partitioning techniques include range partitioning, hash partitioning, and list partitioning.
Example: Partitioning sales data by region to analyze sales perform...read more
Q8. 1. Command for find the 30 days old file in linux
Use the find command with the -mtime option to find files that are 30 days old in Linux.
Use the find command with the -mtime option to specify the number of days.
For example, to find files that are exactly 30 days old: find /path/to/directory -mtime 30
To find files that are older than 30 days: find /path/to/directory -mtime +30
To find files that are newer than 30 days: find /path/to/directory -mtime -30
Q9. 1.What are transformations and actions in spark 2.How to reduce shuffling 3.Questions related to project
Transformations and actions in Spark, reducing shuffling, and project-related questions.
Transformations in Spark are operations that create a new RDD from an existing one, while actions are operations that return a value to the driver program.
Examples of transformations include map, filter, and reduceByKey, while examples of actions include count, collect, and saveAsTextFile.
To reduce shuffling in Spark, you can use techniques like partitioning, caching, and using appropriate...read more
Q10. Use of Vaccum in delta tables in terms of performance
Vaccum in delta tables helps improve performance by reclaiming space and optimizing file sizes.
Vaccum operation helps optimize file sizes by removing small files and compacting larger files.
It helps improve query performance by reducing the amount of data that needs to be scanned.
Vaccum operation can be scheduled to run periodically to maintain optimal performance.
It is recommended to run Vaccum on delta tables after major data deletions or updates.
Example: VACCUM delta.`tabl...read more
Q11. command to copy the data from AWS s3 to redshift
Use the COPY command in Redshift to load data from AWS S3.
Use the COPY command in Redshift to load data from S3 bucket.
Specify the IAM role with necessary permissions in the COPY command.
Provide the S3 file path and Redshift table name in the COPY command.
Ensure the Redshift cluster has the necessary permissions to access S3.
Interview Process at MILEAGE LOGISTICS
Top Data Engineer Interview Questions from Similar Companies
Reviews
Interviews
Salaries
Users/Month