i
Coders Brain
392 Coders Brain Jobs
18-24 years
Site Reliability Engineer - Incident Management (18-24 yrs)
Coders Brain
posted 3d ago
Flexible timing
Key skills for the job
Key Responsibilities :
Leadership & Strategy :
- Provide technical and people leadership to SRE, DevOps, Monitoring, and Database Operations teams.
- Collaborate with leadership on budgeting, planning, hiring, and managing third-party contracts.
- Oversee project status, assemble project teams, and define assignments with schedules and milestones.
Platform Reliability & Performance :
- Drive continuous improvement of reliability, stability, and performance of digital platforms.
- Oversee implementation of automated telemetry, observability, and applied intelligence systems.
- Lead efforts to develop automated alerting, self-healing mechanisms, and intelligent response systems.
Incident & Escalation Management :
- Ensure 24/7 uptime of sites and services, with minimal unplanned downtime.
- Serve as Escalation Manager/Critical Incident Manager during major incidents, leading teams in rapid service restoration.
- Provide on-call escalation support based on 24/7/365 schedules.
- Communicate timely updates and incident reports to senior leadership.
Collaboration & Integration :
- Partner with administrators, platform engineers, and other stakeholders to achieve highly reliable infrastructure, systems, and integrations.
- Collaborate with product, application development, QA, and technology teams to enhance service reliability and performance.
Incident Management & Automation :
- Provide advanced Incident and Problem Management support to effectively diagnose, remediate, and resolve platform issues.
- Automate critical workflows across the platform to minimize manual errors and reduce human intervention.
- Implement ITIL processes like Incident, Problem, and Change Management.
Monitoring & Scalability:
- Design and implement effective monitoring systems with proper alerting and escalation mechanisms for critical events.
- Ensure timely capacity planning and infrastructure upgrades for optimal reliability.
- Develop and refine processes to minimize Mean Time to Recover (MTTR) and extend Mean Time to Failure (MTTF).
Documentation & Compliance:
- Create and maintain detailed documentation, including run books, incident response guides, post-mortem reports, RCAs, and mitigation plans.
- Ensure all changes adhere to established procedures and documentation standards.
Business Alignment :
- Understand business workflows and map technology solutions to address problems effectively.
- Lead conversations and provide technical support to both internal and external customers.
Functional Areas: Software/Testing/Networking
Read full job descriptionPrepare for Site Reliability Engineer roles with real interview advice
12-12 Yrs