About the job:
The Red Hat IT OpenShift team is looking for a Site Reliability Engineer (SRE) based in India (Pune or Bangalore) to join our team. In this role, you will develop, scale, and operate our Red Hat OpenShift Managed Cloud platform. Red Hat OpenShift is our enterprise kubernetes distribution. As an SRE, you will contribute to running Red Hat OpenShift at scale by enabling customer self-service, making our monitoring system more sustainable, and eliminating toil through automation.In the IT OpenShift team you will have the opportunity to influence the complex challenges of scale which are unique to Red Hat IT managed cloud platform services, while using your skills in coding, operations, and large-scale distributed system design. We develop, deploy, and maintain Red Hats next-generation application deployment environment for mission critical custom applications and services across a range of hybrid cloud infrastructures. We are a global team operating on-premise and in the public cloud, using the latest technologies from Red Hat and beyond. Red Hat relies on teamwork and openness for its success. We are a global team and strive to cultivate a transparent environment that makes room for different voices. We learn from our failures in a blameless environment to support the continuous improvement of the team. At Red Hat, your individual contributions have more visibility than most large companies, and visibility means career opportunities and growth. Successful applicants must reside in a state where Red Hat is registered to do business.
What will you do:
- Applies software engineering principles to the operations domain.
- Contributes to a service's codebase, writes automation that aids in the management of a service, and performs operational engineering work to support a service's Service Level Objectives (SLO).
- Ensures service reliability meets users needs, including internally critical and externally visible services
- Uses software & systems engineering to design, build, and run large-scale, distributed, fault-tolerant systems
- Focuses on iterative improvement through toil reduction and error-budget enforcement
- Interfaces with both cloud IaaS and SaaS providers and internal stakeholders, including Support, IT, and Product Engineering, to achieve desired outcomes.
- Participates in an on-call rotation within a geographically distributed team to provide 24x7x365 production support, with responsibility to respond to urgent customer issues
- Practice sustainable incident response and blameless postmortems
- Work within a small agile team to develop and improve SRE methodologies, support your peers, plan and self-improve
- Provide feedback around bugs and feature improvements to the various Red Hat Product Engineering teams
What will you bring:
- Bachelor's degree in computer science or a related technical field involving software or systems engineering, or practical experience demonstrating interest in SRE
- 2+ years of experience of using cloud providers and technologies (Google, Azure, Amazon, OpenStack, etc.)
- 1+ years of experience administering a kubernetes-based production environment
- 2+ years of experience programming with at least one object-oriented language; Golang, or Python are a big plus
- Ability to collaboratively troubleshoot and solve problems in a team setting
- Basic understanding of UNIX or Linux operating systems The following will be considered a plus:
- Demonstrated comfort with collaboration, open communication, and reaching across functional boundaries
- Passion for understanding users needs and delivering outstanding user experiences
Additional Skills:
- Demonstrated ability to quickly and accurately troubleshoot system issues
- Solid understanding of standard TCP/IP networking and common protocols like DNS and HTTP
- 2+ years of experience managing Linux servers running Red Hat Enterprise Linux (RHEL), CentOS, or Fedora hosted at a cloud provider such as Amazon Web Services (AWS), Google Compute Engine (GCE), or Microsoft Azure
- 1+ years of experience with enterprise systems monitoring
- 2+ years of experience with enterprise configuration management software like Red Hat Ansible Automation Platform (AAP)
- Experience with static code analysis tools
- Some experience with code deployment across cloud-based environments
- Some experience with continuous Integration and continuous deployment approaches
- Some experience working with complex distributed systems
- Demonstrated ability to debug, optimize code and automate routine tasks
- Ability to work with minimal supervision and as part of a global team, and problem solving skills
- Experience working with agile development methodologies
Employment Type: Full Time, Permanent
Read full job description