Pyspark Developer

20+ Pyspark Developer Interview Questions and Answers

Updated 15 Jul 2025

Asked in TCS

3d ago

Q. Tell me about your current project. Difference between managed and external table. Architecture of spark. What is RDD. Characteristics of RDD. Meaning of lazy nature. Insert statement for managed and external t...

Ans.

Interview questions for a PySpark Developer

Explained current project and its implementation
Differentiated between managed and external table
Described Spark architecture and RDD
Discussed characteristics of RDD and lazy nature
Provided insert statement for managed and external table
Explained deployment related to code in PySpark
Answered Python related questions
Explained how to convince manager/scrum master for code changes
Discussed team size and management

Asked in Cognizant

4d ago

Q. What is the SQL code for calculating year-on-year growth percentage with year-wise grouping?

Ans.

The SQL code for calculating year-on-year growth percentage with year-wise grouping.

Use the LAG function to get the previous year's value
Calculate the growth percentage using the formula: ((current year value - previous year value) / previous year value) * 100
Group by year to get year-wise growth percentage

Asked in Cognizant

6d ago

Q. What is the difference between coalesce and repartition, as well as between cache and persist?

Ans.

Coalesce reduces the number of partitions without shuffling data, while repartition increases the number of partitions by shuffling data. Cache and persist are used to persist RDDs in memory.

Coalesce is used to reduce the number of partitions without shuffling data, while repartition is used to increase the number of partitions by shuffling data.
Coalesce is more efficient when reducing partitions as it avoids shuffling, while repartition involves shuffling data across the net...read more

Asked in Cognizant

6d ago

Q. What is the SQL query to find the second highest rank in a dataset?

Ans.

SQL query to find the second highest rank in a dataset

Use the ORDER BY clause to sort the ranks in descending order
Use the LIMIT and OFFSET clauses to skip the highest rank and retrieve the second highest rank
Example: SELECT rank FROM dataset ORDER BY rank DESC LIMIT 1 OFFSET 1

Are these interview questions helpful?

Asked in Cognizant

3d ago

Q. What tools are used to connect Google Cloud Platform (GCP) with Apache Spark?

Ans.

To connect Google Cloud Platform with Apache Spark, tools like Dataproc, Cloud Storage, and BigQuery can be used.

Use Google Cloud Dataproc to create managed Spark and Hadoop clusters on GCP.
Store data in Google Cloud Storage and access it from Spark applications.
Utilize Google BigQuery for querying and analyzing large datasets directly from Spark.

Asked in Cognizant

5d ago

Q. What is the process to orchestrate code in Google Cloud Platform (GCP)?

Ans.

Orchestrating code in GCP involves using tools like Cloud Composer or Cloud Dataflow to schedule and manage workflows.

Use Cloud Composer to create, schedule, and monitor workflows using Apache Airflow
Utilize Cloud Dataflow for real-time data processing and batch processing tasks
Use Cloud Functions for event-driven serverless functions
Leverage Cloud Scheduler for job scheduling
Integrate with other GCP services like BigQuery, Pub/Sub, and Cloud Storage for data processing and s...read more

Pyspark Developer Jobs

ETL - Python/Pyspark Developer • 6-9 years

CGI

•

4.0

Bangalore / Bengaluru

Pyspark Developer • 2-3 years

Infosys Limited

•

3.6

₹ 5 L/yr - ₹ 6 L/yr

(AmbitionBox estimate)

Bangalore / Bengaluru

Pyspark Developer_ P • 5-10 years

Infosys

•

3.6

Hyderabad / Secunderabad

View all Pyspark Developer jobs

Asked in Cognizant

4d ago

Q. What are the optimization techniques used in Apache Spark?

Ans.

Optimization techniques in Apache Spark improve performance and efficiency.

Partitioning data to distribute work evenly
Caching frequently accessed data in memory
Using broadcast variables for small lookup tables
Optimizing shuffle operations by reducing data movement
Applying predicate pushdown to filter data early

Asked in Cognizant

6d ago

Q. What is the difference between a DataFrame and an RDD (Resilient Distributed Dataset)?

Ans.

DataFrame is a higher-level abstraction built on top of RDD, providing more structure and optimization capabilities.

DataFrames are distributed collections of data organized into named columns, similar to tables in a relational database.
RDDs are lower-level abstractions representing a collection of objects distributed across a cluster, with no inherent structure.
DataFrames provide optimizations like query optimization and code generation, making them faster for data processing...read more

Share interview questions and help millions of jobseekers 🌟

Asked in Mphasis

2d ago

Q. How can you move 10 tables at a time from MySQL to GCS?

Ans.

Use Apache Sqoop to move 10 tables from MySQL to Google Cloud Storage.

Use Apache Sqoop to import data from MySQL to HDFS
Use Google Cloud Storage connector for Hadoop to move data from HDFS to GCS
Create a script to automate the process for all 10 tables

Asked in Impetus Technologies

4d ago

Q. How do you schedule your PySpark script in a serverless environment?

Ans.

To schedule a Pyspark script in a serverless environment, you can use cloud services like AWS Lambda or Azure Functions.

Use AWS Lambda or Azure Functions to create a serverless function that triggers your Pyspark script.
Set up a schedule using cloud services like AWS CloudWatch Events or Azure Scheduler to run the function at specified intervals.
Ensure your Pyspark script is optimized for serverless execution to minimize costs and maximize performance.

Asked in Cognizant

1d ago

Q. What is the difference between coalesce and repartition in data processing?

Ans.

Coalesce reduces the number of partitions without shuffling data, while repartition reshuffles data to create a specific number of partitions.

Coalesce is used to reduce the number of partitions without shuffling data
Repartition is used to increase or decrease the number of partitions by shuffling data
Coalesce is more efficient when reducing partitions as it avoids shuffling
Repartition is useful when you need to explicitly control the number of partitions
Example: coalesce(5) v...read more

Asked in Capgemini

5d ago

Q. Given a dataset with columns name, subject and marks, return the name of students with marks greater than 40. Sample data: A maths 56, A science 34, B maths 45, B science 98. Expected output: B

Ans.

Filter students with marks greater than 40 from a dataset of names, subjects, and marks.

Use PySpark DataFrame to load the dataset.
Apply a filter condition to select rows where marks > 40.
Extract the unique names of students who meet the criteria.
Example: For the given data, only student 'B' has marks greater than 40.

Asked in Capgemini

1d ago

Q. Write a query to retrieve the employee with a salary higher than the average salary of their department.

Ans.

Retrieve employees earning above their department's average salary using PySpark.

Use DataFrame API to group by department and calculate average salary.
Join the original DataFrame with the average salary DataFrame.
Filter employees where their salary is greater than the average salary of their department.
Example: If department A has an average salary of 5000, select employees with salary > 5000.

Asked in Capgemini

1d ago

Q. Can we change a Hive managed table to an external table?

Ans.

Yes, you can change a Hive managed table to an external table using SQL commands.

Use the command: ALTER TABLE table_name SET TBLPROPERTIES ('EXTERNAL'='TRUE');
You must ensure that the data is moved to a location accessible by Hive.
Example: ALTER TABLE my_table SET TBLPROPERTIES ('EXTERNAL'='TRUE');
After conversion, Hive will not manage the data, allowing for external access.

Asked in Mphasis

2d ago

Q. Write SQL code to calculate the rolling average, lead, and lag.

Ans.

SQL coding for calculating average rolling, lead, and lag functions.

Use window functions like ROWS BETWEEN and ORDER BY for calculating rolling averages.
Use LEAD and LAG functions to access data from previous or next rows.
Example: SELECT col1, AVG(col2) OVER (ORDER BY col1 ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS rolling_avg FROM table_name;

Asked in Impetus Technologies

3d ago

Q. What is the entry point of your pipeline?

Ans.

The entry point of a pipeline in PySpark is the SparkSession object.

The entry point of a PySpark pipeline is typically created using the SparkSession object.
The SparkSession object is used to create DataFrames, register tables, and execute SQL queries.
Example: spark = SparkSession.builder.appName('example').getOrCreate()

Asked in EY Global Delivery Services ( EY GDS)

3d ago

Q. Explain the Spark architecture.

Ans.

Spark architecture is based on a master-slave model with a cluster manager and a distributed file system.

Spark has a driver program that communicates with a cluster manager to coordinate tasks.
The cluster manager allocates resources to worker nodes, which execute tasks in parallel.
Spark uses a distributed file system, such as HDFS, to store and access data across the cluster.
Spark also includes a variety of libraries and APIs for data processing, machine learning, and streami...read more

Asked in LTIMindtree

2d ago

Q. Azure linked services vs Azure dataset

Ans.

Azure linked services are connections to external data sources, while Azure datasets are structured data objects within Azure Data Factory.

Azure linked services are used to connect to external data sources such as databases, storage accounts, and SaaS applications.
Azure datasets are structured data objects within Azure Data Factory that represent data from linked services or other sources.
Linked services define the connection information and credentials required to connect to...read more

Asked in Fino Paytech

6d ago

Q. What are actions and transformations in PySpark?

Ans.

Actions and transformations are two types of operations in PySpark. Actions return a value to the driver program, while transformations create a new RDD.

Actions are operations that trigger the execution of the Spark job, such as collect(), count(), and saveAsTextFile().
Transformations are operations that create a new RDD from an existing one, such as map(), filter(), and reduceByKey().
Actions are lazy operations, meaning they are not executed until an action is called, while ...read more

Asked in Fino Paytech

6d ago

Q. What are the differences between RDD and DataFrame in PySpark?

Ans.

RDD is a distributed collection of objects, while DataFrame is a distributed collection of data organized into named columns.

RDD is more low-level and requires manual schema definition, while DataFrame provides a higher-level API with schema inference.
DataFrames are optimized for processing structured data, while RDDs are more suitable for unstructured data.
DataFrames support SQL queries and optimizations through Catalyst optimizer, while RDDs do not have built-in optimizatio...read more

Asked in Fino Paytech

2d ago

Q. What is the difference between a stage and a task in PySpark?

Ans.

A stage is a collection of tasks that perform a specific computation, while a task is a unit of work that is executed on a single executor.

Stage is a higher-level unit of work that can be broken down into multiple tasks.
Tasks are individual units of work that are executed on a single executor.
Stages are used to organize and coordinate tasks in a Spark job.
Tasks are the actual units of computation that are performed by executors.
Example: A stage may involve reading data from a...read more

Asked in Infosys

6d ago

Q. What are RDDs and DataFrames

Ans.

RDDs and DataFrames are data structures in Apache Spark for processing and analyzing large datasets.

RDDs (Resilient Distributed Datasets) are the fundamental data structure of Spark, representing a collection of elements that can be operated on in parallel.
DataFrames are distributed collections of data organized into named columns, similar to a table in a relational database.
DataFrames are built on top of RDDs, providing a more user-friendly API for structured data processing...read more

Asked in Mphasis

2d ago

Q. How do you handle null values in Spark?

Ans.

Null handling in Spark involves handling missing or null values in data processing.

Use functions like coalesce, na.fill, na.drop to handle null values
Consider using when and otherwise functions for conditional null handling
Be cautious of potential null pointer exceptions when working with null values

Asked in Infosys

3d ago

Q. Why is Spark used?

Ans.

Spark is used for big data processing due to its speed, scalability, and ease of use.

Spark is used for processing large volumes of data quickly and efficiently.
It offers in-memory processing which makes it faster than traditional MapReduce.
Spark provides a wide range of libraries for diverse tasks like SQL, streaming, machine learning, and graph processing.
It can run on various platforms like Hadoop, Kubernetes, and standalone clusters.
Spark's ease of use and compatibility wi...read more

Asked in Mphasis

5d ago

Q. optimisation in pyspark

Ans.

Optimisation in PySpark involves improving performance and efficiency of Spark jobs.

Use partitioning to distribute data evenly across nodes
Avoid shuffling data between nodes as much as possible
Use broadcast variables for small lookup tables
Cache intermediate results to avoid recomputation
Optimize transformations and actions for better performance

Asked in LTIMindtree

4d ago

Q. Transformations vs Actions

Ans.

Transformations are lazy operations that create new RDDs, while Actions are operations that trigger computation and return results.

Transformations are operations like map, filter, and reduceByKey that create a new RDD from an existing one.
Actions are operations like count, collect, and saveAsTextFile that trigger computation on an RDD and return results.
Transformations are lazy and are only executed when an action is called, allowing for optimization of computations.
Actions a...read more

Asked in Mphasis

2d ago

Q. What are operators in Airflow?

Ans.

Airflow operators are used to define the tasks to be executed in a workflow.

Operators are classes that define the logic to execute a task in Airflow.
There are various types of operators such as BashOperator, PythonOperator, and more.
Operators can be customized to suit specific task requirements.
Operators can be chained together to create complex workflows.
Example: BashOperator executes a bash command, PythonOperator runs a Python function.

Asked in LTIMindtree

2d ago

Q. Write a program to count the occurrences of each word in a given text file using PySpark.

Ans.

A program to count the occurrences of each word in a text document.

Use Spark RDD to read the text file and split the lines into words
Apply transformations like map and reduceByKey to count the occurrences of each word
Handle punctuation and case sensitivity to ensure accurate word count results

Asked in Mphasis

2d ago

Q. How do you explode JSON data using PySpark?

Ans.

Exploding JSON data in PySpark allows you to transform nested structures into flat tables for easier analysis.

Use the `explode` function to flatten arrays in JSON data.
Example: df.select(explode(df.json_column)).show() to display exploded rows.
For nested JSON, combine `explode` with `selectExpr` to access deeper levels.
Example: df.selectExpr('explode(nested_array) as exploded').show() for nested arrays.