Add office photos
Cognizant logo
Engaged Employer

Cognizant

Verified
3.8
based on 50.1k Reviews
Video summary
Proud winner of ABECA 2024 - AmbitionBox Employee Choice Awards
Filter interviews by
Pyspark Developer
Clear (1)

Cognizant Pyspark Developer Interview Questions and Answers

Updated 30 Dec 2024

Q1. What is the difference between coalesce and repartition, as well as between cache and persist?

Ans.

Coalesce reduces the number of partitions without shuffling data, while repartition increases the number of partitions by shuffling data. Cache and persist are used to persist RDDs in memory.

  • Coalesce is used to reduce the number of partitions without shuffling data, while repartition is used to increase the number of partitions by shuffling data.

  • Coalesce is more efficient when reducing partitions as it avoids shuffling, while repartition involves shuffling data across the net...read more

Add your answer
right arrow

Q2. What is the SQL query to find the second highest rank in a dataset?

Ans.

SQL query to find the second highest rank in a dataset

  • Use the ORDER BY clause to sort the ranks in descending order

  • Use the LIMIT and OFFSET clauses to skip the highest rank and retrieve the second highest rank

  • Example: SELECT rank FROM dataset ORDER BY rank DESC LIMIT 1 OFFSET 1

Add your answer
right arrow

Q3. What is the SQL code for calculating year-on-year growth percentage with year-wise grouping?

Ans.

The SQL code for calculating year-on-year growth percentage with year-wise grouping.

  • Use the LAG function to get the previous year's value

  • Calculate the growth percentage using the formula: ((current year value - previous year value) / previous year value) * 100

  • Group by year to get year-wise growth percentage

Add your answer
right arrow

Q4. What tools are used to connect Google Cloud Platform (GCP) with Apache Spark?

Ans.

To connect Google Cloud Platform with Apache Spark, tools like Dataproc, Cloud Storage, and BigQuery can be used.

  • Use Google Cloud Dataproc to create managed Spark and Hadoop clusters on GCP.

  • Store data in Google Cloud Storage and access it from Spark applications.

  • Utilize Google BigQuery for querying and analyzing large datasets directly from Spark.

Add your answer
right arrow
Discover Cognizant interview dos and don'ts from real experiences

Q5. What is the process to orchestrate code in Google Cloud Platform (GCP)?

Ans.

Orchestrating code in GCP involves using tools like Cloud Composer or Cloud Dataflow to schedule and manage workflows.

  • Use Cloud Composer to create, schedule, and monitor workflows using Apache Airflow

  • Utilize Cloud Dataflow for real-time data processing and batch processing tasks

  • Use Cloud Functions for event-driven serverless functions

  • Leverage Cloud Scheduler for job scheduling

  • Integrate with other GCP services like BigQuery, Pub/Sub, and Cloud Storage for data processing and s...read more

Add your answer
right arrow

Q6. What are the optimization techniques used in Apache Spark?

Ans.

Optimization techniques in Apache Spark improve performance and efficiency.

  • Partitioning data to distribute work evenly

  • Caching frequently accessed data in memory

  • Using broadcast variables for small lookup tables

  • Optimizing shuffle operations by reducing data movement

  • Applying predicate pushdown to filter data early

Add your answer
right arrow

Q7. What is the difference between coalesce and repartition in data processing?

Ans.

Coalesce reduces the number of partitions without shuffling data, while repartition reshuffles data to create a specific number of partitions.

  • Coalesce is used to reduce the number of partitions without shuffling data

  • Repartition is used to increase or decrease the number of partitions by shuffling data

  • Coalesce is more efficient when reducing partitions as it avoids shuffling

  • Repartition is useful when you need to explicitly control the number of partitions

  • Example: coalesce(5) v...read more

Add your answer
right arrow

Q8. What is the difference between a DataFrame and an RDD (Resilient Distributed Dataset)?

Ans.

DataFrame is a higher-level abstraction built on top of RDD, providing more structure and optimization capabilities.

  • DataFrames are distributed collections of data organized into named columns, similar to tables in a relational database.

  • RDDs are lower-level abstractions representing a collection of objects distributed across a cluster, with no inherent structure.

  • DataFrames provide optimizations like query optimization and code generation, making them faster for data processing...read more

Add your answer
right arrow

More about working at Cognizant

Back
Awards Leaf
AmbitionBox Logo
Top Rated Mega Company - 2024
Awards Leaf
Awards Leaf
AmbitionBox Logo
Top Rated IT/ITES Company - 2024
Awards Leaf
HQ - Teaneck. New Jersey., United States (USA)
Contribute & help others!
Write a review
Write a review
Share interview
Share interview
Contribute salary
Contribute salary
Add office photos
Add office photos

Interview Process at Cognizant Pyspark Developer

based on 2 interviews
Interview experience
4.5
Good
View more
interview tips and stories logo
Interview Tips & Stories
Ace your next interview with expert advice and inspiring stories
Recently Viewed
DESIGNATION
Pyspark Developer
25 interviews
INTERVIEWS
Tech Mahindra
No Interviews
INTERVIEWS
Cognizant
No Interviews
INTERVIEWS
Tech Mahindra
20 top interview questions
REVIEWS
HGS
No Reviews
INTERVIEWS
Cognizant
No Interviews
INTERVIEWS
Cognizant
No Interviews
INTERVIEWS
Tech Mahindra
No Interviews
REVIEWS
HGS
No Reviews
INTERVIEWS
Tech Mahindra
No Interviews
Share an Interview
Stay ahead in your career. Get AmbitionBox app
play-icon
play-icon
qr-code
Helping over 1 Crore job seekers every month in choosing their right fit company
70 Lakh+

Reviews

5 Lakh+

Interviews

4 Crore+

Salaries

1 Cr+

Users/Month

Contribute to help millions

Made with ❤️ in India. Trademarks belong to their respective owners. All rights reserved © 2024 Info Edge (India) Ltd.

Follow us
  • Youtube
  • Instagram
  • LinkedIn
  • Facebook
  • Twitter