The Site Reliability Engineering (SRE) team is responsible for the reliability, scalability, stability and performance of systems and services.
They work with cross-functional teams to design, build and maintain systems and they troubleshoot issues when they arise. They bridge the gap between development and operations teams.
They work closely with business teams to define Service Level Objectives (SLO) and agreements (SLA) of critical systems. They also monitor and maintain the uptime of these systems in-line with the defined SLO s and SLA s.
They deploy and manage monitoring tools to gain insights on system health and performance.
They analyze performance, identify bottlenecks and implement solutions to improve a system s scalability and latency durations.
They develop scripts, implement tools and automation frameworks to reduce the manual intervention efforts of deployment, monitoring and scaling.
They work with development teams for design and development of observability practices like logging, metrics, tracing, etc. They aim to diagnose and troubleshoot issues proactively.
They create actionable alerts on monitoring systems to ensure rapid response for potential production incidents.
They forecast resource needs and provision adequately for current and future demand.
They design and execute chaos experiments to test system s failure resiliency.
They own, define and implement the Disaster Recovery (DR) processes for systems.
They also conduct planned and unplanned mock DR drills to test for response preparedness during production incidents.
They ensure that security best practices are followed and implemented during design and operations of systems.
They also own and maintain documentation of processes, playbooks, and systems.
They publish KPI reports and other system health updates on a regular basis to the business.
Requirements:
Must-have - Bachelors degree, preferably in CS or a related field, or equivalent experience
Must-have - 12+ years of overall IT experience
Must-have - 7+ year of proven work experience as a Senior Site Reliability Engineer or a similar position.
Must-have - 5+ years of AWS Cloud experience with AWS Certified DevOps Engineer or SysOps or Security etc.
Must-have - AWS experience - 3+ years experience with using a broadrange of AWS technologies (e.g. EC2, RDS, ELB, S3, VPC, CloudWatch & Monitoring Tools) to develop and maintain an Amazon AWS based cloud solution, with an emphasis on best practice cloud security.
Must-have - 2+ year of experience in CDN and/or Cache systems like Fastly, Akamai, CloudFront, etc.