Home
Communities
Companies
- Companies
  
  Discover best places to work
- Compare Companies
  
  Compare & find best workplace
- Add Office Photos
  
  Bring your workplace to life
- Add Company Benefits
  
  Highlight your company's perks
Reviews
- Company reviews
  
  Read reviews for 6L+ companies
- Write a review
  
  Rate your former or current company
Salaries
- Browse salaries
  
  Discover salaries for 6L+ companies
- Salary calculator
  
  Calculate your take home salary
- Are you paid fairly?
  
  Check your market value
- Share your salary
  
  Help other jobseekers
Interviews
- Company interviews
  
  Read interviews for 40K+ companies
- Campus placements
  
  Interviews questions for 1K+ colleges
- Share interview questions
  
  Contribute your interview questions
Jobs
Awards

RATE NOW!
- ABECA 2025
  
  RATE NOW!
  
  AmbitionBox Employee Choice Awards - 4th Edition
- ABECA 2024
  
  AmbitionBox Employee Choice Awards - 3rd Edition
- AmbitionBox Best Places to Work 2022
  
  2nd Edition
- AmbitionBox Best Places to Work 2021
  
  1st Edition

Add office photos

Employer? Claim Account for FREE

Capgemini Engineering

Compare

3.5

based on 2.1k Reviews

107 Capgemini Engineering Jobs

PySpark/Databricks Engineer - Big Data Technologies (2-13 yrs)

Aricent

3.5

based on 2.1k Reviews

2-13 years

Capgemini Engineering

posted 4d ago

Job Role Insights

Flexible timing

Key skills for the job

SQL Big Data Spark Hadoop Administration Pyspark Azure Databricks

+ 5 more

Job Description

Job : PySpark/Databricks Engineer

Open for Multiple Locations with WFO and WFH

Job Description :

We are looking for a PySpark solutions developer and data engineer that is able to design and build solutions for one of our Fortune 500 Client programs, which aims to build a data standardized and curation-based Hadoop cluster

This high visibility, fast-paced key initiative will integrate data across internal and external sources, provide analytical insights, and integrate with the customer s critical systems

Key Responsibilities :

- Ability to design, build and unit test applications on Spark framework on Python.

- Build PySpark based applications for both batch and streaming requirements, which will require in-depth knowledge on majority of Hadoop and NoSQL databases as well.

- Develop and execute data pipeline testing processes and validate business rules and policies.

- Optimize performance of the built Spark applications in Hadoop using configurations around Spark Context, Spark-SQL, Data Frame, and Pair RDDs.

- Optimize performance for data access requirements by choosing the appropriate native Hadoop file formats (Avro, Parquet, ORC etc) and compression codec respectively.

- Ability to design build real-time applications using Apache Kafka Spark Streaming

- Build integrated solutions leveraging Unix shell scripting, RDBMS, Hive, HDFS File System, HDFS File Types, HDFS compression codec.

- Build data tokenization libraries and integrate with Hive Spark for column-level obfuscation

- Experience in processing large amounts of structured and unstructured data, including integrating data from multiple sources.

- Create and maintain integration and regression testing framework on Jenkins integrated with BitBucket and/or GIT repositories

- Participate in the agile development process, and document and communicate issues and bugs relative to data standards in scrum meetings

- Work collaboratively with onsite and offshore team.

- Develop review technical documentation for artifacts delivered.

- Ability to solve complex data-driven scenarios and triage towards defects and production issues

- Ability to learn-unlearn-relearn concepts with an open and analytical mindset

- Participate in code release and production deployment.

- Challenge and inspire team members to achieve business results in a fast paced and quickly changing environment

- BE/B.Tech/ B.Sc. in Computer Science/Statistics, Econometrics from an accredited college or university.

- Minimum 3 years of extensive experience in design, build and deployment of PySpark-based applications.

- Expertise in handling complex large-scale Big Data environments preferably (20Tb+).

- Minimum 3 years of experience in the following: HIVE, YARN, HDFS preferably on Hortonworks Data Platform.

- Good implementation experience of OOPS concepts.

- Hands-on experience writing complex SQL queries, exporting, and importing large amounts of data using utilities.

- Ability to build abstracted, modularized reusable code components.

- Hands-on experience in generating/parsing XML, JSON documents, and REST API request/responses