Home
Communities
Companies
- Companies
  
  Discover best places to work
- Compare Companies
  
  Compare & find best workplace
- Add Office Photos
  
  Bring your workplace to life
- Add Company Benefits
  
  Highlight your company's perks
Reviews
- Company reviews
  
  Read reviews for 6L+ companies
- Write a review
  
  Rate your former or current company
Salaries
- Browse salaries
  
  Discover salaries for 6L+ companies
- Salary calculator
  
  Calculate your take home salary
- Are you paid fairly?
  
  Check your market value
- Share your salary
  
  Help other jobseekers
Interviews
- Company interviews
  
  Read interviews for 40K+ companies
- Campus placements
  
  Interviews questions for 1K+ colleges
- Share interview questions
  
  Contribute your interview questions
Jobs
Awards

WINNERS AWAITED!
- ABECA 2025
  
  WINNERS AWAITED!
  
  AmbitionBox Employee Choice Awards - 4th Edition
- ABECA 2024
  
  AmbitionBox Employee Choice Awards - 3rd Edition
- AmbitionBox Best Places to Work 2022
  
  2nd Edition
- AmbitionBox Best Places to Work 2021
  
  1st Edition

Add office photos

Engaged Employer

Coders Brain

Compare

3.3

based on 39 Reviews

392 Coders Brain Jobs

Site Reliability Engineer - Incident Management (18-24 yrs)

Coders Brain Technology Private Limited

3.3

based on 39 Reviews

18-24 years

Coders Brain

posted 3d ago

Job Role Insights

Flexible timing

Key skills for the job

DevOps Incident Management Site Reliability Engineering ITIL Monitoring Tools Performance Tuning

+ 1 more

Job Description

Key Responsibilities :

Leadership & Strategy :

- Provide technical and people leadership to SRE, DevOps, Monitoring, and Database Operations teams.

- Collaborate with leadership on budgeting, planning, hiring, and managing third-party contracts.

- Oversee project status, assemble project teams, and define assignments with schedules and milestones.

Platform Reliability & Performance :

- Drive continuous improvement of reliability, stability, and performance of digital platforms.

- Oversee implementation of automated telemetry, observability, and applied intelligence systems.

- Lead efforts to develop automated alerting, self-healing mechanisms, and intelligent response systems.

Incident & Escalation Management :

- Ensure 24/7 uptime of sites and services, with minimal unplanned downtime.

- Serve as Escalation Manager/Critical Incident Manager during major incidents, leading teams in rapid service restoration.

- Provide on-call escalation support based on 24/7/365 schedules.

- Communicate timely updates and incident reports to senior leadership.

Collaboration & Integration :

- Partner with administrators, platform engineers, and other stakeholders to achieve highly reliable infrastructure, systems, and integrations.

- Collaborate with product, application development, QA, and technology teams to enhance service reliability and performance.

Incident Management & Automation :

- Provide advanced Incident and Problem Management support to effectively diagnose, remediate, and resolve platform issues.

- Automate critical workflows across the platform to minimize manual errors and reduce human intervention.

- Implement ITIL processes like Incident, Problem, and Change Management.

Monitoring & Scalability:

- Design and implement effective monitoring systems with proper alerting and escalation mechanisms for critical events.

- Ensure timely capacity planning and infrastructure upgrades for optimal reliability.

- Develop and refine processes to minimize Mean Time to Recover (MTTR) and extend Mean Time to Failure (MTTF).

Documentation & Compliance:

- Create and maintain detailed documentation, including run books, incident response guides, post-mortem reports, RCAs, and mitigation plans.

- Ensure all changes adhere to established procedures and documentation standards.

Business Alignment :

- Understand business workflows and map technology solutions to address problems effectively.

- Lead conversations and provide technical support to both internal and external customers.