Home
Communities
Companies
- Companies
  
  Discover best places to work
- Compare Companies
  
  Compare & find best workplace
- Add Office Photos
  
  Bring your workplace to life
- Add Company Benefits
  
  Highlight your company's perks
Reviews
- Company reviews
  
  Read reviews for 6L+ companies
- Write a review
  
  Rate your former or current company
Salaries
- Browse salaries
  
  Discover salaries for 6L+ companies
- Salary calculator
  
  Calculate your take home salary
- Are you paid fairly?
  
  Check your market value
- Share your salary
  
  Help other jobseekers
- Gratuity calculator
  
  Check your gratuity amount
- HRA calculator
  
  Check how much of your HRA is tax-free
- Salary hike calculator
  
  Check your salary hike
Interviews
- Company interviews
  
  Read interviews for 40K+ companies
- Campus placements
  
  Interviews questions for 2K+ colleges
- Share interview questions
  
  Contribute your interview questions
Jobs
Awards

WINNERS AWAITED!
- ABECA 2025
  
  WINNERS AWAITED!
  
  AmbitionBox Employee Choice Awards - 4th Edition
- ABECA 2024
  
  AmbitionBox Employee Choice Awards - 3rd Edition
- AmbitionBox Best Places to Work 2022
  
  2nd Edition
- AmbitionBox Best Places to Work 2021
  
  1st Edition

Add office photos

Employer? Claim Account for FREE

IT Firm

Compare

No reviews yet

48 IT Firm Jobs

Site Reliability Engineer - Docker/Kubernetes (5-8 yrs)

IT Firm

5-8 years

Site Reliability Engineer - Docker/Kubernetes (5-8 yrs)

IT Firm

posted 1mon ago

Job Role Insights

Key skills for the job

DevOps Python Golang Kubernetes Site Reliability Engineering Terraform

+ 3 more

Job Description

We are looking for an experienced SRE Engineer to manage production systems and optimize system reliability, scalability, and performance.

Key Responsibilities :

- Provide production support and troubleshoot real-time issues.

- Develop and maintain CI/CD pipelines using Jenkins and Git/Bitbucket.

- Manage deployments with Docker and Kubernetes.

- Set up observability tools (Grafana, Prometheus, Instana).

- Automate infrastructure using Terraform and follow SRE practices.

Required Skills :

- Production Support, Docker, Kubernetes

- CI/CD (Jenkins, Git/Bitbucket)

- Observability (Grafana, Prometheus)

- Terraform, TypeScript, Python

- SRE principles

Responsibilities :

System Reliability & Availability : Ensure that the services are highly available, reliable, and scalable in both production and non-production environments.

Incident Management: Lead the investigation and resolution of incidents, identify the root causes, and ensure the recovery of services. You will also contribute to postmortems and implement preventative measures.

Monitoring & Observability :

- Build and maintain monitoring and alerting systems. Implement metrics, logs, and tracing to ensure transparency into system health and performance.

Automation :

Develop and maintain automation tools and systems to reduce manual intervention and improve operational efficiency.

Capacity Planning : Work with the team to forecast capacity needs and implement scaling solutions to ensure our systems are always prepared for increased load.

Performance Optimization : Identify and eliminate bottlenecks and optimize the performance of critical systems.

Collaboration : Work closely with development, QA, and operations teams to ensure smooth deployment and transition of code to production environments.

Security & Compliance : Ensure that security best practices are followed across our infrastructure. Assist with vulnerability management and compliance tasks.

Disaster Recovery: Design and implement disaster recovery and backup strategies to ensure business continuity.

Required Qualifications :

- Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent experience.

- 3+ years of experience in Site Reliability Engineering, DevOps, or similar roles.

- Strong experience with cloud platforms (AWS, GCP, Azure).

- Proficiency in infrastructure automation tools (e.g., Terraform, Ansible, Puppet, Chef).

- Expertise in containerization and orchestration tools (Docker, Kubernetes).

- Experience with CI/CD tools and pipelines (e.g., Jenkins, GitLab, CircleCI).

- Knowledge of monitoring and observability tools (e.g., Prometheus, Grafana, ELK stack, Datadog, New Relic).

- Solid understanding of Linux/Unix systems and networking fundamentals.

- Programming/scripting skills in at least one language (e.g., Python, Go, Bash, Ruby, or Java).

- Strong troubleshooting skills and the ability to debug complex, distributed systems.

- Excellent communication and collaboration skills.

Preferred Qualifications :

- Experience with infrastructure as code (IaC) and configuration management tools.

- Familiarity with microservices architecture.

- Experience with performance tuning and optimization in a large-scale production environment.

- Knowledge of security practices and tools related to cloud infrastructure.

- Understanding of Service-Level Objectives (SLOs) and Service-Level Indicators (SLIs).

What We Offer :