Sturlite Electric Interview Questions and Answers

Question 1

Asked in

Q1. What is the difference between coalesce and repartition, as well as between cache and persist?

Add your answer

Answer

Coalesce reduces the number of partitions without shuffling data, while repartition increases the number of partitions by shuffling data. Cache and persist are used to persist RDDs in memory.

Coalesce is used to reduce the number of partitions without shuffling data, while repartition is used to increase the number of partitions by shuffling data.
Coalesce is more efficient when reducing partitions as it avoids shuffling, while repartition involves shuffling data across the net...read more

Question 2

Asked in

Pyspark Developer Interview

Q2. What is the SQL query to find the second highest rank in a dataset?

Add your answer

Answer

SQL query to find the second highest rank in a dataset

Use the ORDER BY clause to sort the ranks in descending order
Use the LIMIT and OFFSET clauses to skip the highest rank and retrieve the second highest rank
Example: SELECT rank FROM dataset ORDER BY rank DESC LIMIT 1 OFFSET 1

Question 3

Asked in

Pyspark Developer Interview

Q3. What is the SQL code for calculating year-on-year growth percentage with year-wise grouping?

Add your answer

Answer

The SQL code for calculating year-on-year growth percentage with year-wise grouping.

Use the LAG function to get the previous year's value
Calculate the growth percentage using the formula: ((current year value - previous year value) / previous year value) * 100
Group by year to get year-wise growth percentage

Question 4

Asked in

Pyspark Developer Interview

Q4. What tools are used to connect Google Cloud Platform (GCP) with Apache Spark?

Add your answer

Answer

To connect Google Cloud Platform with Apache Spark, tools like Dataproc, Cloud Storage, and BigQuery can be used.

Use Google Cloud Dataproc to create managed Spark and Hadoop clusters on GCP.
Store data in Google Cloud Storage and access it from Spark applications.
Utilize Google BigQuery for querying and analyzing large datasets directly from Spark.

Question 5

Asked in

Pyspark Developer Interview

Q5. What is the process to orchestrate code in Google Cloud Platform (GCP)?

Add your answer

Answer

Orchestrating code in GCP involves using tools like Cloud Composer or Cloud Dataflow to schedule and manage workflows.

Use Cloud Composer to create, schedule, and monitor workflows using Apache Airflow
Utilize Cloud Dataflow for real-time data processing and batch processing tasks
Use Cloud Functions for event-driven serverless functions
Leverage Cloud Scheduler for job scheduling
Integrate with other GCP services like BigQuery, Pub/Sub, and Cloud Storage for data processing and s...read more

Question 6

Asked in

Pyspark Developer Interview

Q6. What are the optimization techniques used in Apache Spark?

Add your answer

Answer

Optimization techniques in Apache Spark improve performance and efficiency.

Partitioning data to distribute work evenly
Caching frequently accessed data in memory
Using broadcast variables for small lookup tables
Optimizing shuffle operations by reducing data movement
Applying predicate pushdown to filter data early

Question 7

Asked in

Pyspark Developer Interview

Q7. What is the difference between coalesce and repartition in data processing?

Add your answer

Answer

Coalesce reduces the number of partitions without shuffling data, while repartition reshuffles data to create a specific number of partitions.

Coalesce is used to reduce the number of partitions without shuffling data
Repartition is used to increase or decrease the number of partitions by shuffling data
Coalesce is more efficient when reducing partitions as it avoids shuffling
Repartition is useful when you need to explicitly control the number of partitions
Example: coalesce(5) v...read more

Question 8

Asked in

Pyspark Developer Interview

Q8. What is the difference between a DataFrame and an RDD (Resilient Distributed Dataset)?

Add your answer

Answer

DataFrame is a higher-level abstraction built on top of RDD, providing more structure and optimization capabilities.

DataFrames are distributed collections of data organized into named columns, similar to tables in a relational database.
RDDs are lower-level abstractions representing a collection of objects distributed across a cluster, with no inherent structure.
DataFrames provide optimizations like query optimization and code generation, making them faster for data processing...read more

Sturlite Electric Interview Questions and Answers

Q1. What is the difference between coalesce and repartition, as well as between cache and persist?

Q2. What is the SQL query to find the second highest rank in a dataset?

Q3. What is the SQL code for calculating year-on-year growth percentage with year-wise grouping?

Q4. What tools are used to connect Google Cloud Platform (GCP) with Apache Spark?

Q5. What is the process to orchestrate code in Google Cloud Platform (GCP)?

Q6. What are the optimization techniques used in Apache Spark?

Q7. What is the difference between coalesce and repartition in data processing?

Q8. What is the difference between a DataFrame and an RDD (Resilient Distributed Dataset)?

More about working at Cognizant

Interview Process at Sturlite Electric