Spark & Hadoop Developer - Distributed Systems (6-10 yrs)
Bluebyte Technologies
posted 2d ago
Key skills for the job
Key Responsibilities :
- Design, develop, and maintain large-scale distributed data processing systems using Apache Spark and Apache Hadoop ecosystem tools (HDFS, Hive, HBase, Pig, etc.).
- Build data pipelines to ingest, process, and analyze structured and unstructured data from diverse sources.
- Work with data scientists, analysts, and business teams to understand data requirements and translate them into technical solutions.
- Optimize Spark applications and Hadoop jobs for performance, scalability, and reliability.
- Ensure the successful integration of Spark applications with Hadoop cluster management tools like YARN or Mesos.
- Implement data processing solutions in a distributed computing environment, ensuring data consistency and fault tolerance.
- Develop and maintain workflows for scheduling, monitoring, and managing large-scale batch and real-time data processing tasks.
- Design and implement data models, storage structures, and ETL processes to support advanced analytics.
- Troubleshoot and debug data processing issues across distributed systems.
- Participate in code reviews, mentoring junior developers, and establishing best practices.
- Stay up to date with the latest developments in the Big Data ecosystem and contribute to the evolution of our data infrastructure.
Required Skills and Experience :
- 6+ years of experience in data engineering, with at least 3 years of experience in Apache Spark and Hadoop ecosystem technologies.
- Strong proficiency in Java, Scala, and/or Python for building data processing applications.
- Solid experience with Apache Spark for batch and real-time data processing.
- Deep understanding of Hadoop ecosystem components such as HDFS, MapReduce, Hive, Pig, HBase, and others.
- Expertise in working with distributed systems and large-scale data processing frameworks.
- Strong knowledge of SQL and relational databases, and experience with NoSQL databases (HBase, Cassandra, etc.).
- Familiarity with streaming frameworks like Apache Kafka, Apache Flink, or Spark Streaming for real-time data processing.
- Experience with cloud platforms such as AWS, Azure, or Google Cloud for big data workloads.
- Knowledge of data governance, data quality, and security best practices in big data environments.
- Experience with ETL processes and data integration techniques.
- Strong problem-solving and debugging skills in distributed systems.
- Familiarity with containerization technologies such as Docker and orchestration tools like Kubernetes.
- Excellent communication skills and ability to work in an Agile team environment.
Functional Areas: Software/Testing/Networking
Read full job description10-12 Yrs