Home
Communities
Companies
- Companies
  
  Discover best places to work
- Compare Companies
  
  Compare & find best workplace
- Add Office Photos
  
  Bring your workplace to life
- Add Company Benefits
  
  Highlight your company's perks
Reviews
- Company reviews
  
  Read reviews for 6L+ companies
- Write a review
  
  Rate your former or current company
Salaries
- Browse salaries
  
  Discover salaries for 6L+ companies
- Salary calculator
  
  Calculate your take home salary
- Are you paid fairly?
  
  Check your market value
- Share your salary
  
  Help other jobseekers
Interviews
- Company interviews
  
  Read interviews for 40K+ companies
- Campus placements
  
  Interviews questions for 1K+ colleges
- Share interview questions
  
  Contribute your interview questions
Jobs
Awards

WINNERS AWAITED!
- ABECA 2025
  
  WINNERS AWAITED!
  
  AmbitionBox Employee Choice Awards - 4th Edition
- ABECA 2024
  
  AmbitionBox Employee Choice Awards - 3rd Edition
- AmbitionBox Best Places to Work 2022
  
  2nd Edition
- AmbitionBox Best Places to Work 2021
  
  1st Edition

Add office photos

Employer? Claim Account for FREE

Fulcrum Digital

Compare

3.6

based on 134 Reviews

12 Fulcrum Digital Jobs

Fulcrum Digital - Site Reliability Engineer - Incident Management (3-6 yrs)

Fulcrum Digital

3.6

based on 134 Reviews

3-6 years

Fulcrum Digital

posted 16hr ago

Job Role Insights

Flexible timing

Key skills for the job

Data Analytics Python Incident Management Big Data Site Reliability Engineering ITIL

+ 2 more

Job Description

About the Role :

We are seeking a highly motivated and experienced Site Reliability Engineer (SRE) with a strong focus on Big Data technologies to join our growing team.

In this role, you will play a critical part in ensuring the availability, performance, and scalability of our mission-critical Big Data platforms.

You will work closely with development teams, data engineers, and other stakeholders to build and maintain a robust and resilient production environment.

Responsibilities :

Production Environment Management : Plan, manage, and oversee all aspects of a Production Environment for Big Data Platforms, including Hadoop, Spark, Nifi, and Impala.

Performance Optimization : Define and implement strategies for Application Performance Monitoring and Optimization within the production environment.

Incident Response & Management :

- Respond effectively to production incidents and system outages.

- Analyze incident root causes and implement proactive measures to prevent future occurrences.

- Track and measure the reduction of incidents over time.

- Batch Processing & Scheduling: Ensure the accuracy and timeliness of batch production scheduling and processes.

Data Analysis & Troubleshooting :

- Create and execute queries on Big Data platforms and relational databases to identify and resolve process issues.

- Perform ad-hoc data research, file manipulation/transfer, and investigate process issues as requested by users.

- Holistic Problem Solving : Take a holistic approach to problem-solving, connecting the dots across the technology stack during production events to optimize Mean Time To Recover (MTTR).

Service Lifecycle Management :

- Engage in and improve the entire lifecycle of services, from inception and design to deployment, operation, and refinement.

- Analyze ITSM activities and provide feedback to development teams on operational gaps or resiliency concerns.

- Support services before they go live through system design consulting, capacity planning, and launch reviews.

CI/CD & Automation :

- Support the application CI/CD pipeline for promoting software into higher environments.

- Lead in DevOps automation and best practices, including pipeline management and software design.

Service Monitoring & Scaling :

- Monitor availability, latency, and overall system health of live services.

- Scale systems sustainably through automation and continuous improvement initiatives.

- Collaboration : Work effectively within a global team spread across multiple geographies and time zones.

- Knowledge Sharing : Share knowledge and explain processes and procedures effectively to other team members.

Required Skills :

- 3+ years of experience as a Site Reliability Engineer (SRE) with a focus on Big Data technologies.

- Strong experience with Linux operating systems.

- In-depth knowledge of ITSM/ITIL frameworks.

- Proven experience with Big Data technologies such as Hadoop, Spark, Nifi, and Impala.

- 2+ years of experience in running production-grade Big Data systems.

- Solid understanding of SQL or Oracle fundamentals.

- Experience with scripting languages (e., Python, Bash) and pipeline management tools.

Desired Skills :

- Experience with industry-standard CI/CD tools (e. , Git/BitBucket, Jenkins, Maven).

- Experience with cloud platforms (e., AWS, Azure, GCP).

- Experience with containerization technologies (e., Docker, Kubernetes)