i
Madhees
Director - Site Reliability Engineering (10-12 yrs)
Madhees
posted 19d ago
Fixed timing
Key skills for the job
Director SRE for the Partner Company into Product Development. Looking for someone who worked on end to end SRE with Devops.
Qualifications :
- Bachelors or Masters degree in Computer Science, Data Engineering, AI/ML, or a related field.
- 10+ years of experience in software release management, with at least 3-5 years in SRE or DevOps environments, preferably in AI or data-driven applications.
- Proven experience building and managing both release management and SRE teams in complex, multi product environments.
- Strong knowledge of AI/ML operations (MLOps), data pipeline management, and cloud-based AI product deployments.
- Expertise in release management tools (Jenkins, GitLab, Git, Jira) and SRE tools such as Prometheus, Grafana, Datadog, or similar monitoring systems.
- Experience with cloud platforms (AWS, GCP, Azure), containerization (Kubernetes, Docker), and infrastructure automation tools (Terraform, Ansible).
- Excellent problem-solving, organizational, and leadership skills, with a strong track record of driving continuous improvement in both release and operational reliability processes.
Preferred Qualifications :
- Experience deploying and maintaining large-scale AI/ML models in production environments, including monitoring, retraining, and operationalization.
- Familiarity with ITIL, MLOps, or DevOps frameworks and best practices.
- Knowledge of cloud-based services and tools specifically designed for AI/ML (e.g., AWS SageMaker, TensorFlow, PyTorch).
- Demonstrated ability to manage incident response and root cause analysis in complex software ecosystems.
Responsibilities :
- Build, mentor, and lead a high-performing SRE and release management team.
- Foster a culture of ownership, collaboration, and continuous improvement.
- Define team goals, performance metrics, and career development plans.
- Develop and implement SRE best practices, including monitoring, alerting, capacity planning, and incident response.
- Ensure the reliability, availability, and performance of our production systems.
- Drive the adoption of automation and infrastructure-as-code principles.
- Establish and maintain service level objectives (SLOs) and service level agreements (SLAs).
- Oversee the end-to-end release management process, ensuring smooth and efficient deployments.
- Implement and maintain CI/CD pipelines using tools like Jenkins, GitLab, and Git.
- Promote DevOps principles and practices across the organization.
- Manage and optimize data pipelines and MLOps workflows.
- Manage and optimize cloud infrastructure on platforms like AWS, GCP, or Azure.
- Implement and manage containerization and orchestration using Kubernetes and Docker.
- Utilize infrastructure automation tools like Terraform and Ansible to ensure consistent and scalable deployments.
- Oversee the monitoring and management of large-scale AI/ML models in production.
- Lead incident response and root cause analysis efforts.
- Implement proactive monitoring and alerting systems using tools like Prometheus, Grafana, and Datadog.
- Develop and maintain incident response playbooks and procedures.
- Improve system resilience, and minimize downtime.
- Collaborate with development, product, and data science teams to ensure alignment on reliability and release goals.
- Communicate effectively with stakeholders at all levels of the organization.
- Document processes, procedures, and best practices.
Functional Areas: Software/Testing/Networking
Read full job descriptionPrepare for Site Reliability Engineer roles with real interview advice