Premium Employer

i

This company page is being actively managed by Infosys Team. If you also belong to the team, you can get access from here

Infosys Verified Tick

Compare button icon Compare button icon Compare
3.7

based on 37.3k Reviews

Filter interviews by

Infosys Pyspark Developer Interview Questions and Answers

Updated 7 Sep 2024

Infosys Pyspark Developer Interview Experiences

1 interview found

Pyspark Developer Interview Questions & Answers

user image Jeet Dhanesha

posted on 7 Sep 2024

Interview experience
3
Average
Difficulty level
-
Process Duration
-
Result
-
Round 1 - Technical 

(2 Questions)

  • Q1. Why Spark is used?
  • Ans. 

    Spark is used for big data processing due to its speed, scalability, and ease of use.

    • Spark is used for processing large volumes of data quickly and efficiently.

    • It offers in-memory processing which makes it faster than traditional MapReduce.

    • Spark provides a wide range of libraries for diverse tasks like SQL, streaming, machine learning, and graph processing.

    • It can run on various platforms like Hadoop, Kubernetes, and st...

  • Answered by AI
  • Q2. What are RDDs and DataFrames
  • Ans. 

    RDDs and DataFrames are data structures in Apache Spark for processing and analyzing large datasets.

    • RDDs (Resilient Distributed Datasets) are the fundamental data structure of Spark, representing a collection of elements that can be operated on in parallel.

    • DataFrames are distributed collections of data organized into named columns, similar to a table in a relational database.

    • DataFrames are built on top of RDDs, providi...

  • Answered by AI

Skills evaluated in this interview

Pyspark Developer Jobs at Infosys

View all

Interview questions from similar companies

Interview experience
4
Good
Difficulty level
Moderate
Process Duration
Less than 2 weeks
Result
No response

I applied via Walk-in and was interviewed in Nov 2024. There were 3 interview rounds.

Round 1 - One-on-one 

(2 Questions)

  • Q1. What are the optimization techniques used in Apache Spark?
  • Ans. 

    Optimization techniques in Apache Spark improve performance and efficiency.

    • Partitioning data to distribute work evenly

    • Caching frequently accessed data in memory

    • Using broadcast variables for small lookup tables

    • Optimizing shuffle operations by reducing data movement

    • Applying predicate pushdown to filter data early

  • Answered by AI
  • Q2. What is the difference between coalesce and repartition, as well as between cache and persist?
  • Ans. 

    Coalesce reduces the number of partitions without shuffling data, while repartition increases the number of partitions by shuffling data. Cache and persist are used to persist RDDs in memory.

    • Coalesce is used to reduce the number of partitions without shuffling data, while repartition is used to increase the number of partitions by shuffling data.

    • Coalesce is more efficient when reducing partitions as it avoids shuffling...

  • Answered by AI
Round 2 - One-on-one 

(2 Questions)

  • Q1. What is the SQL query to find the second highest rank in a dataset?
  • Ans. 

    SQL query to find the second highest rank in a dataset

    • Use the ORDER BY clause to sort the ranks in descending order

    • Use the LIMIT and OFFSET clauses to skip the highest rank and retrieve the second highest rank

    • Example: SELECT rank FROM dataset ORDER BY rank DESC LIMIT 1 OFFSET 1

  • Answered by AI
  • Q2. What is the SQL code for calculating year-on-year growth percentage with year-wise grouping?
  • Ans. 

    The SQL code for calculating year-on-year growth percentage with year-wise grouping.

    • Use the LAG function to get the previous year's value

    • Calculate the growth percentage using the formula: ((current year value - previous year value) / previous year value) * 100

    • Group by year to get year-wise growth percentage

  • Answered by AI
Round 3 - One-on-one 

(2 Questions)

  • Q1. What tools are used to connect Google Cloud Platform (GCP) with Apache Spark?
  • Ans. 

    To connect Google Cloud Platform with Apache Spark, tools like Dataproc, Cloud Storage, and BigQuery can be used.

    • Use Google Cloud Dataproc to create managed Spark and Hadoop clusters on GCP.

    • Store data in Google Cloud Storage and access it from Spark applications.

    • Utilize Google BigQuery for querying and analyzing large datasets directly from Spark.

  • Answered by AI
  • Q2. What is the process to orchestrate code in Google Cloud Platform (GCP)?
  • Ans. 

    Orchestrating code in GCP involves using tools like Cloud Composer or Cloud Dataflow to schedule and manage workflows.

    • Use Cloud Composer to create, schedule, and monitor workflows using Apache Airflow

    • Utilize Cloud Dataflow for real-time data processing and batch processing tasks

    • Use Cloud Functions for event-driven serverless functions

    • Leverage Cloud Scheduler for job scheduling

    • Integrate with other GCP services like BigQ...

  • Answered by AI

Interview Preparation Tips

Topics to prepare for Cognizant Pyspark Developer interview:
  • sql
  • spark
  • python
  • Cloud
Interview preparation tips for other job seekers - It is essential to prepare thoroughly before the interview.
Interview experience
4
Good
Difficulty level
Easy
Process Duration
Less than 2 weeks
Result
Not Selected

I was interviewed in Sep 2024.

Round 1 - Coding Test 

Hadoop + Spark MCQ online test

Round 2 - Technical 

(2 Questions)

  • Q1. Spark Architecture
  • Q2. Transformations vs Actions
  • Ans. 

    Transformations are lazy operations that create new RDDs, while Actions are operations that trigger computation and return results.

    • Transformations are operations like map, filter, and reduceByKey that create a new RDD from an existing one.

    • Actions are operations like count, collect, and saveAsTextFile that trigger computation on an RDD and return results.

    • Transformations are lazy and are only executed when an action is c...

  • Answered by AI

Skills evaluated in this interview

Interview experience
4
Good
Difficulty level
Moderate
Process Duration
Less than 2 weeks
Result
No response

I applied via Naukri.com and was interviewed in Sep 2024. There was 1 interview round.

Round 1 - Coding Test 

1. Find duplicate
2. 2,3 highest salary

Interview experience
4
Good
Difficulty level
Moderate
Process Duration
2-4 weeks
Result
No response

I applied via Naukri.com and was interviewed in Jan 2024. There were 2 interview rounds.

Round 1 - Coding Test 

Basic python coding, list, dict, generators etc

Round 2 - HR 

(1 Question)

  • Q1. Salary negotiation

Interview Preparation Tips

Topics to prepare for DXC Technology Pyspark Developer interview:
  • Python
  • Spark
  • RDD
  • SQL
Interview preparation tips for other job seekers - Code well
Interview experience
3
Average
Difficulty level
-
Process Duration
-
Result
-
Round 1 - Technical 

(1 Question)

  • Q1. Conceptual questions
Interview experience
3
Average
Difficulty level
-
Process Duration
-
Result
-
Round 1 - Technical 

(1 Question)

  • Q1. Basic SQL and Python Questions
Interview experience
5
Excellent
Difficulty level
-
Process Duration
-
Result
-
Round 1 - Technical 

(2 Questions)

  • Q1. What is the difference between coalesce and repartition in data processing?
  • Ans. 

    Coalesce reduces the number of partitions without shuffling data, while repartition reshuffles data to create a specific number of partitions.

    • Coalesce is used to reduce the number of partitions without shuffling data

    • Repartition is used to increase or decrease the number of partitions by shuffling data

    • Coalesce is more efficient when reducing partitions as it avoids shuffling

    • Repartition is useful when you need to explici...

  • Answered by AI
  • Q2. What is the difference between a DataFrame and an RDD (Resilient Distributed Dataset)?
  • Ans. 

    DataFrame is a higher-level abstraction built on top of RDD, providing more structure and optimization capabilities.

    • DataFrames are distributed collections of data organized into named columns, similar to tables in a relational database.

    • RDDs are lower-level abstractions representing a collection of objects distributed across a cluster, with no inherent structure.

    • DataFrames provide optimizations like query optimization a...

  • Answered by AI
Interview experience
5
Excellent
Difficulty level
-
Process Duration
-
Result
-
Round 1 - Technical 

(4 Questions)

  • Q1. Spark architecture
  • Q2. Word count program
  • Ans. 

    A program to count the occurrences of each word in a text document.

    • Use Spark RDD to read the text file and split the lines into words

    • Apply transformations like map and reduceByKey to count the occurrences of each word

    • Handle punctuation and case sensitivity to ensure accurate word count results

  • Answered by AI
  • Q3. Azure integration services
  • Q4. Azure linked services vs Azure dataset
  • Ans. 

    Azure linked services are connections to external data sources, while Azure datasets are structured data objects within Azure Data Factory.

    • Azure linked services are used to connect to external data sources such as databases, storage accounts, and SaaS applications.

    • Azure datasets are structured data objects within Azure Data Factory that represent data from linked services or other sources.

    • Linked services define the con...

  • Answered by AI

Skills evaluated in this interview

Interview experience
5
Excellent
Difficulty level
-
Process Duration
-
Result
No response
Round 1 - Coding Test 

Basic to moderate sql questions

Interview Preparation Tips

Topics to prepare for Capgemini Pyspark Developer interview:
  • SQL
  • Spark

Infosys Interview FAQs

How many rounds are there in Infosys Pyspark Developer interview?
Infosys interview process usually has 1 rounds. The most common rounds in the Infosys interview process are Technical.
How to prepare for Infosys Pyspark Developer interview?
Go through your CV in detail and study all the technologies mentioned in your CV. Prepare at least two technologies or languages in depth if you are appearing for a technical interview at Infosys. The most common topics and skills that interviewers at Infosys expect are Python, Big Data, Pyspark, Spark and Software Quality Assurance.
What are the top questions asked in Infosys Pyspark Developer interview?

Some of the top questions asked at the Infosys Pyspark Developer interview -

  1. What are RDDs and DataFra...read more
  2. Why Spark is us...read more

Tell us how to improve this page.

Join Infosys Creating the next opportunity for people, businesses & communities
Infosys Pyspark Developer Salary
based on 12 salaries
₹4.5 L/yr - ₹12.2 L/yr
At par with the average Pyspark Developer Salary in India
View more details

Infosys Pyspark Developer Reviews and Ratings

based on 1 review

5.0/5

Rating in categories

3.0

Skill development

5.0

Work-life balance

5.0

Salary

5.0

Job security

5.0

Company culture

5.0

Promotions

5.0

Work satisfaction

Explore 1 Review and Rating
Python Pyspark Developer

Pune

3-5 Yrs

Not Disclosed

Pyspark Developer

Bangalore / Bengaluru

5-7 Yrs

Not Disclosed

Pyspark Developer_5+ Years

Hyderabad / Secunderabad,

Pune

+1

7-12 Yrs

Not Disclosed

Explore more jobs
Technology Analyst
56.9k salaries
unlock blur

₹3 L/yr - ₹11 L/yr

Senior Systems Engineer
49.8k salaries
unlock blur

₹2.8 L/yr - ₹8 L/yr

System Engineer
39k salaries
unlock blur

₹2.5 L/yr - ₹5.5 L/yr

Technical Lead
30.7k salaries
unlock blur

₹5.2 L/yr - ₹19.5 L/yr

Senior Associate Consultant
27.2k salaries
unlock blur

₹4.3 L/yr - ₹16.7 L/yr

Explore more salaries
Compare Infosys with

TCS

3.7
Compare

Wipro

3.7
Compare

Cognizant

3.8
Compare

Accenture

3.9
Compare
Did you find this page helpful?
Yes No
write
Share an Interview