45 Recro Jobs
Recro - Big Data Developer - Hadoop/PySpark (5-7 yrs)
Recro
posted 4d ago
Flexible timing
We are seeking a Big Data Engineer (Python & PySpark) who will be responsible for developing and optimizing data pipelines using Python and PySpark, along with handling large datasets in distributed systems.
The ideal candidate will have hands-on experience with Apache Spark, Hadoop, Hive, Kafka, and cloud-based solutions, particularly on Google Cloud Platform (GCP).
This position requires someone who is highly technical, can design scalable data architectures, and is capable of performance tuning to meet business requirements.
The successful candidate will also work on cloud-based solutions, data transformation, and work closely with other teams to optimize big data applications.
Key Responsibilities :
- Design and develop scalable data pipelines using Python and PySpark to handle large volumes of structured and unstructured data.
- Integrate data from diverse sources into data processing workflows to ensure data availability for analytics and reporting.
- Build and optimize Apache Spark jobs for data transformation, aggregation, and processing at scale.
- Tune the performance of Spark jobs and manage resource allocation to ensure efficient processing across large datasets.
- Work with Big Data tools including Hadoop, HDFS, Hive, and Kafka to manage and process large datasets in distributed systems.
- Utilize Kafka for stream processing and ensure that data pipelines handle both batch and real-time data efficiently.
- Implement cloud-based big data solutions using Google Cloud Platform (GCP), particularly Google Cloud Dataproc, BigQuery, and other relevant GCP services.
- Optimize cloud data storage, processing, and computing resources to ensure cost-effective scaling.
- Conduct performance tuning for Big Data applications and ensure the scalability and reliability of data systems.
- Troubleshoot issues related to the performance, efficiency, and quality of the data pipelines and applications.
- Work closely with Data Scientists, Data Analysts, and other engineers to ensure seamless integration of big data applications.
- Collaborate with cross-functional teams to understand business requirements and translate them into scalable data solutions.
Skills & Qualifications :
Technical Skills :
- Strong hands-on experience with Python for building data processing pipelines and PySpark for working with distributed data systems.
- Proficiency in using PySpark RDDs and DataFrames to perform large-scale data transformations and aggregations.
- In-depth knowledge of Apache Spark (RDDs, DataFrames, tuning, etc.) for distributed data processing.
- Strong experience with Hadoop, HDFS, Hive, and Kafka for managing and processing large datasets in a distributed environment.
- Hands-on experience with Google Cloud Platform (GCP) services, including Google Cloud Dataproc, BigQuery, and Cloud Storage.
- Familiarity with cloud-based data infrastructure, data storage, and processing solutions for large-scale applications.
- Strong background in handling large-scale, distributed systems, ensuring the reliable processing of massive datasets across clusters.
- Knowledge of partitioning, shuffling, and data storage strategies to optimize distributed data jobs.
- Experience in performance tuning for distributed data applications, ensuring efficiency in both batch and stream processing jobs.
- Familiarity with optimizing resource usage in cloud environments to reduce processing time and cost
Functional Areas: Software/Testing/Networking
Read full job descriptionPrepare for Big Data Developer roles with real interview advice
5-6 Yrs
7-8 Yrs