Home
Communities
Companies
- Companies
  
  Discover best places to work
- Compare Companies
  
  Compare & find best workplace
- Add Office Photos
  
  Bring your workplace to life
- Add Company Benefits
  
  Highlight your company's perks
Reviews
- Company reviews
  
  Read reviews for 6L+ companies
- Write a review
  
  Rate your former or current company
Salaries
- Browse salaries
  
  Discover salaries for 6L+ companies
- Salary calculator
  
  Calculate your take home salary
- Are you paid fairly?
  
  Check your market value
- Share your salary
  
  Help other jobseekers
- Gratuity calculator
  
  Check your gratuity amount
- HRA calculator
  
  Check how much of your HRA is tax-free
- Salary hike calculator
  
  Check your salary hike
Interviews
- Company interviews
  
  Read interviews for 40K+ companies
- Campus placements
  
  Interviews questions for 2K+ colleges
- Share interview questions
  
  Contribute your interview questions
Jobs
Awards

WINNERS AWAITED!
- ABECA 2025
  
  WINNERS AWAITED!
  
  AmbitionBox Employee Choice Awards - 4th Edition
- ABECA 2024
  
  AmbitionBox Employee Choice Awards - 3rd Edition
- AmbitionBox Best Places to Work 2022
  
  2nd Edition
- AmbitionBox Best Places to Work 2021
  
  1st Edition

Add office photos

Employer? Claim Account for FREE

Whitefield Careers

Compare

4.0

based on 1 Review

36 Whitefield Careers Jobs

Site Reliability Engineer (5-7 yrs)

Whitefield Careers

4.0

based on 1 Review

5-7 years

Whitefield Careers

posted 17hr ago

Job Role Insights

Key skills for the job

DevOps Python AWS Cloud Computing Golang Kubernetes

+ 5 more

Job Description

Overview :

The Site Reliability Engineer (SRE) plays a vital role in bridging the gap between development and operations, utilizing a software engineering mindset to automate and enhance the reliability, scalability, and performance of the organization's infrastructure and applications.

As a key contributor, the SRE ensures that services are available, performant, and rapidly evolving while shifting operational work load to automated systems.

This role is essential for driving the adoption of best practices in agile and DevOps methodologies, by implementing reliable systems and processes that lead to improved service resilience. The successful candidate will work closely with development teams to design robust systems, manage incident responses, and create extensive documentation for improved knowledge transfer.

They are also responsible for monitoring infrastructure and application health, collaborating across teams, and leading the charge toward a culture of reliability.

Key Responsibilities :

- Design, implement, and maintain scalable and reliable systems.

- Develop monitoring frameworks and automated alerts for performance issues.

- Collaborate closely with development teams to ensure smooth deployments.

- Respond to incidents, troubleshoot and resolve service disruptions effectively.

- Conduct post-mortems and implement improvements to prevent future incidents.

- Develop and maintain automation tools for operational tasks.

- Optimize system performance through performance testing and tuning.

- Manage cloud infrastructure and resources efficiently.

- Create and maintain documentation for system architecture and processes.

- Enforce security best practices across all systems and applications.

- Evaluate and recommend new technologies and processes.

- Implement effective backup and disaster recovery strategies.

- Educate and mentor teams on SRE best practices and methodologies.

- Participate in on-call rotations for incident management.

- Contribute to the development and improvement of CI/CD pipelines.

Required Qualifications:

- Bachelor's degree in Computer Science, Engineering, or related field.

- 5+ years of experience in a Site Reliability Engineering role or similar.

- Proficient in at least one programming language (i.e., Python, Go, Ruby).

- Experience with cloud platforms (AWS, Azure, Google Cloud).

- Strong understanding of containerization technologies (Docker, Kubernetes).

- Knowledge of monitoring tools (Prometheus, Grafana, Datadog).

- Solid grasp of networking concepts and protocols.

- Hands-on experience with configuration management tools (Ansible, Puppet, Chef).

- Familiarity with version control systems (Git, SVN).

- Experience in Agile software development practices.

- Strong problem-solving skills and analytical thinking.

- Excellent communication and collaboration skills.

- Ability to work independently and as part of a team.

- Understanding of IT security fundamentals.

- Ability to learn and adapt quickly in a fast-paced environment.