Home
Communities
Companies
- Companies
  
  Discover best places to work
- Compare Companies
  
  Compare & find best workplace
- Add Office Photos
  
  Bring your workplace to life
- Add Company Benefits
  
  Highlight your company's perks
Reviews
- Company reviews
  
  Read reviews for 6L+ companies
- Write a review
  
  Rate your former or current company
Salaries
- Browse salaries
  
  Discover salaries for 6L+ companies
- Salary calculator
  
  Calculate your take home salary
- Are you paid fairly?
  
  Check your market value
- Share your salary
  
  Help other jobseekers
- Gratuity calculator
  
  Check your gratuity amount
- HRA calculator
  
  Check how much of your HRA is tax-free
- Salary hike calculator
  
  Check your salary hike
Interviews
- Company interviews
  
  Read interviews for 40K+ companies
- Campus placements
  
  Interviews questions for 2K+ colleges
- Share interview questions
  
  Contribute your interview questions
Jobs
Awards

WINNERS AWAITED!
- ABECA 2025
  
  WINNERS AWAITED!
  
  AmbitionBox Employee Choice Awards - 4th Edition
- ABECA 2024
  
  AmbitionBox Employee Choice Awards - 3rd Edition
- AmbitionBox Best Places to Work 2022
  
  2nd Edition
- AmbitionBox Best Places to Work 2021
  
  1st Edition

Add office photos

Employer? Claim Account for FREE

Pylon Management Consulting

Compare

3.5

based on 41 Reviews

69 Pylon Management Consulting Jobs

Principal Site Reliability Engineer - Kubernetes/Docker (9-14 yrs)

Pylon Management Consulting

3.5

based on 41 Reviews

9-14 years

Pylon Management Consulting

posted 3d ago

Job Role Insights

Fixed timing

Key skills for the job

Translation Cloud Services Kubernetes Incident Management Site Reliability Engineering Docker

+ 4 more

Job Description

About the Role :

We are seeking a highly experienced and visionary Principal Site Reliability Engineer (SRE) to lead our efforts in ensuring the reliability, scalability, and performance of our critical systems. In this role, you will be a technical leader, driving the adoption of SRE principles and practices across the organization. You will be responsible for designing and implementing robust infrastructure, automation, and monitoring solutions to maintain high availability and optimize system performance.

Responsibilities :

SRE Leadership & Strategy :

- Develop and implement SRE strategies and best practices to improve system reliability and performance.

- Lead the design and implementation of highly available and scalable infrastructure solutions.

- Define and enforce service level objectives (SLOs), service level indicators (SLIs), and service level agreements (SLAs).

- Champion a culture of observability, automation, and continuous improvement.

Infrastructure Design & Automation :

- Design and implement infrastructure-as-code (IaC) using tools like Terraform, CloudFormation, or Ansible.

- Architect and manage container orchestration platforms (Kubernetes, Docker Swarm).

- Build and maintain CI/CD pipelines for automated deployments.

- Implement and manage configuration management systems.

Monitoring & Observability :

- Design and implement comprehensive monitoring and logging solutions using tools like Prometheus, Grafana, ELK stack, or Datadog.

- Develop and maintain alerting and incident response procedures.

- Analyze metrics and logs to identify performance bottlenecks and potential issues.

- Implement distributed tracing to understand system behavior.

Incident Management & Response :

- Lead incident response efforts, ensuring timely resolution of critical issues.

- Conduct post-incident reviews to identify root causes and implement preventive measures.

- Develop and maintain runbooks and playbooks for incident response.

- Drive improvements in incident management processes.

Performance Optimization & Capacity Planning :

- Identify and resolve performance bottlenecks through profiling, tracing, and optimization.

- Conduct capacity planning and forecasting to ensure system scalability.

- Optimize resource utilization and reduce operational costs.

Security & Compliance :

- Implement and maintain security best practices across the infrastructure.

- Ensure compliance with relevant industry standards and regulations.

- Conduct security audits and vulnerability assessments.

Mentoring & Knowledge Sharing :

- Mentor and guide junior SREs, fostering a culture of learning and growth.

- Share knowledge and best practices through documentation, presentations, and training sessions.

- Act as a technical leader and subject matter expert.

Technical Skills :

Cloud Platforms :

- Deep expertise in at least one major cloud platform (AWS, Azure, GCP).

- Experience with cloud-native technologies and services.

Containerization & Orchestration :

- Expert-level knowledge of Docker and Kubernetes.

- Experience with container registry services.

Infrastructure as Code (IaC) :

- Proficiency in Terraform, CloudFormation, or Ansible.

CI/CD Tools :

- Experience with Jenkins, GitLab CI, CircleCI, or similar tools.

Monitoring & Logging :

- Expertise in Prometheus, Grafana, ELK stack, Datadog, or similar tools.

Scripting & Automation :

- Strong scripting skills in Python, Bash, or Go.

Operating Systems :

- Expert-level knowledge of Linux system administration.

Networking :

- Deep understanding of networking concepts and protocols (TCP/IP, DNS, HTTP, etc.).

Security :

- Strong understanding of security best practices and tools.

Databases :

- Experience with relational and NoSQL databases.

Distributed Systems :

- Understanding of distributed system principles and architectures.

Qualifications :

- Experience : 9-14 years of experience in Site Reliability Engineering or a related field.

- Education : Bachelor's degree in Computer Science, Software Engineering, or a related field.

- Certifications : Cloud certifications (e.g., AWS Certified DevOps Engineer, Google Cloud Certified Professional DevOps Engineer) are highly desirable.

Soft Skills :

- Exceptional problem-solving and analytical skills.

- Strong communication and interpersonal skills.

- Excellent leadership and mentoring abilities.

- Ability to work effectively in a fast-paced environment.

- Strong sense of ownership and accountability.

- Ability to think strategically and drive innovation.

Benefits :

- Competitive salary and benefits package.

- Opportunity to work on cutting-edge technologies and challenging problems.

- Collaborative and supportive work environment.

- Opportunities for professional development and growth.

- Chance to make a significant impact on the reliability and performance of critical systems