12 Fulcrum Digital Jobs
3-6 years
Fulcrum Digital - Site Reliability Engineer - Incident Management (3-6 yrs)
Fulcrum Digital
posted 16hr ago
Flexible timing
Key skills for the job
About the Role :
We are seeking a highly motivated and experienced Site Reliability Engineer (SRE) with a strong focus on Big Data technologies to join our growing team.
In this role, you will play a critical part in ensuring the availability, performance, and scalability of our mission-critical Big Data platforms.
You will work closely with development teams, data engineers, and other stakeholders to build and maintain a robust and resilient production environment.
Responsibilities :
Production Environment Management : Plan, manage, and oversee all aspects of a Production Environment for Big Data Platforms, including Hadoop, Spark, Nifi, and Impala.
Performance Optimization : Define and implement strategies for Application Performance Monitoring and Optimization within the production environment.
Incident Response & Management :
- Respond effectively to production incidents and system outages.
- Analyze incident root causes and implement proactive measures to prevent future occurrences.
- Track and measure the reduction of incidents over time.
- Batch Processing & Scheduling: Ensure the accuracy and timeliness of batch production scheduling and processes.
Data Analysis & Troubleshooting :
- Create and execute queries on Big Data platforms and relational databases to identify and resolve process issues.
- Perform ad-hoc data research, file manipulation/transfer, and investigate process issues as requested by users.
- Holistic Problem Solving : Take a holistic approach to problem-solving, connecting the dots across the technology stack during production events to optimize Mean Time To Recover (MTTR).
Service Lifecycle Management :
- Engage in and improve the entire lifecycle of services, from inception and design to deployment, operation, and refinement.
- Analyze ITSM activities and provide feedback to development teams on operational gaps or resiliency concerns.
- Support services before they go live through system design consulting, capacity planning, and launch reviews.
CI/CD & Automation :
- Support the application CI/CD pipeline for promoting software into higher environments.
- Lead in DevOps automation and best practices, including pipeline management and software design.
Service Monitoring & Scaling :
- Monitor availability, latency, and overall system health of live services.
- Scale systems sustainably through automation and continuous improvement initiatives.
- Collaboration : Work effectively within a global team spread across multiple geographies and time zones.
- Knowledge Sharing : Share knowledge and explain processes and procedures effectively to other team members.
Required Skills :
- 3+ years of experience as a Site Reliability Engineer (SRE) with a focus on Big Data technologies.
- Strong experience with Linux operating systems.
- In-depth knowledge of ITSM/ITIL frameworks.
- Proven experience with Big Data technologies such as Hadoop, Spark, Nifi, and Impala.
- 2+ years of experience in running production-grade Big Data systems.
- Solid understanding of SQL or Oracle fundamentals.
- Experience with scripting languages (e., Python, Bash) and pipeline management tools.
Desired Skills :
- Experience with industry-standard CI/CD tools (e. , Git/BitBucket, Jenkins, Maven).
- Experience with cloud platforms (e., AWS, Azure, GCP).
- Experience with containerization technologies (e., Docker, Kubernetes)
Functional Areas: Software/Testing/Networking
Read full job descriptionPrepare for Site Reliability Engineer roles with real interview advice
Basic office environment, no friendships,avg salary, hrs are good , work life balance
Environment
Read 2 reviews3-6 Yrs
4-7 Yrs
5-7 Yrs
8-10 Yrs
7-10 Yrs
4-8 Yrs