About the Job:
Combines software and systems engineering to build and run large-scale, distributed, fault-tolerant systems. SRE ensures that servicesboth our internally critical and our externally-visible systemshave reliability and uptime appropriate to users' needs and a fast rate of improvement while keeping an eye on capacity and performance. As SREs are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to solve a broad spectrum of problems. Practices such as limiting time spent on operational work, blameless postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting and dynamic day-to-day work.This role is an opportunity to join a high-performing agile team building tools, platforms, and services that will allow Red Hat to continue to expand its customer base and service portfolio.
What will you do:
- Implement and improve the whole lifecycle of servicesfrom inception and design, monitoring, metrics, through deployment (on premise and cloud based), operation and refinement.
- Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews.
- Maintain services once they are live by measuring and monitoring availability, latency and overall system health.
- Collaborate with internal and external team on release management activities including developing automated deployment and testing tools
- Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity.
- Provide best effort off-hours support
- Work as part of an Agile team to proactively communicate status and complete deliverables on schedule
- Propose and implement continuous improvement activities
- Work on the standardization and documentation of common DevOps procedures
- Participate in the development of new features and bug fixes on Red Hat software services
- Practice sustainable incident response and blameless postmortems, and drive ticket resolution for our key applications and platforms.
What will you bring:
- Experience in Linux or UNIX systems administration, supporting critical production systems in an enterprise environment
- OpenShift/Kubernetes or other container orchestration platform with knowledge of Docker and containers
- Knowledge of configuration management tools such as Ansible and Chef
- Experience with common scripting and automation languages like Python, Ruby, and Bash
- Experience with code deployments across on-premise and cloud environments such as AWS
- Experience designing and deploying highly scalable and resilient applications and platforms
- Java, Golang or Ruby development experience is a plus
- Experience with GitLab Pipeline or GitHub Actions for automation is a plus
- Red Hat Certified Engineer (RHCE) is a plus
- Experience with content delivery networks like Akamai is a plus
- Ability to multi-task, as well as excellent communication and team collaboration skills
- Experience with agile project methodologies such as Scrum or Kanban
Employment Type: Full Time, Permanent
Read full job description