4 Sagent M&C Jobs
MGR Site Reliability Engineering - India
Sagent M&C
posted 22d ago
Flexible timing
Key skills for the job
About the Opportunity:
We are looking for a dynamic and experienced MGR Site Reliability Engineering with a strong background in cloud infrastructure (GCP/Azure), monitoring and observability stacks (such as Datadog, Dynatrace), and team leadership. This individual will play a key role in ensuring the reliability, scalability, and performance of our systems while managing a high-performing team of SREs. You will collaborate closely with development, operations, and security teams to ensure our platform meets the highest standards of availability, performance, and security.
Your Day-to-Day at Sagent:
Leadership & Team Management: Lead, mentor, and grow a team of Site Reliability Engineers (SREs) focused on building and maintaining highly available, scalable, and reliable systems. Foster a culture of continuous improvement, automation, and operational excellence across the team. Conduct regular one-on-ones, performance reviews, and career development for team members. Manage and prioritize the team’s work while aligning efforts with overall business goals and objectives.
Infrastructure & Cloud Management: Own the architecture and operational health of systems running on Google Cloud Platform (GCP) and/or Azure. Ensure systems are designed and operated to meet Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Service Level Agreements (SLAs). Implement and manage infrastructure-as-code (IaC) practices to ensure repeatability and consistency. Drive cloud cost optimization and resource utilization best practices. Monitoring & Observability: Lead the implementation, configuration, and maintenance of monitoring and observability tools such as Datadog, Dynatrace, Prometheus, Grafana, etc. Develop and maintain proactive monitoring, alerting, and automated remediation strategies for key business-critical services. Define and implement metrics to measure system performance, reliability, and availability. Build and improve dashboards to provide stakeholders with real-time insights into system health and performance. Incident Management & Root Cause Analysis: Lead and coordinate incident response efforts, ensuring quick recovery and post-incident reviews (PIRs) are completed to identify root causes and preventive actions. Drive a culture of blameless post-mortems to learn from incidents and prevent recurrence.
Collaboration & Stakeholder Management: Work closely with engineering, product, and operations teams to identify and address reliability challenges and improvement opportunities. Serve as a technical advisor to leadership on cloud infrastructure, observability, and reliability best practices. Automation & Efficiency: Champion automation across all aspects of infrastructure, deployment, monitoring, and incident response. Implement tools and processes that increase the efficiency of operations, reduce toil, and improve system uptime.
We'd love to hear from you if you have: Experience: 12+ years of experience in Site Reliability Engineering, DevOps, or related roles with a focus on cloud infrastructure. 4+ years of experience managing and leading teams in a high-growth, fast-paced environment. Extensive hands-on experience with Google Cloud Platform (GCP) or Microsoft Azure. Expertise in monitoring and observability stacks (e.g., Datadog, Dynatrace, Prometheus, Grafana). Strong experience in infrastructure automation tools (e.g., Terraform, Ansible, CloudFormation, etc.). Deep understanding of SRE concepts, including SLOs, SLIs, and SLAs. Skills: Proficiency in one or more programming/scripting languages (e.g., Python, Go, Shell, etc.). Experience with CI/CD pipelines and infrastructure as code (IaC). Knowledge of containerization and orchestration tools (e.g., Docker, Kubernetes). Strong understanding of system architecture, networking, and security in cloud environments. Leadership & Communication: Proven track record of managing and scaling high-performing teams. Strong communication and interpersonal skills, with the ability to articulate complex technical concepts to both technical and non-technical stakeholders. Ability to influence and drive change at the organizational level. Strong problem-solving skills and a proactive approach to addressing challenges.
Preferred Qualifications: Experience in the FinTech or financial services industry. Certifications in cloud platforms (GCP, Azure) or relevant tools. Knowledge of security best practices for cloud-native systems. Familiarity with agile methodologies and project management tools (e.g., Jira, Confluence).
#LI-SM1
Employment Type: Full Time, Permanent
Read full job descriptionPrepare for Site Reliability Engineer roles with real interview advice
2-6 Yrs
₹ 5 - 9L/yr
Chennai