Platform Reliability Availability: Collaborate with the Engineering Director and Principal Engineer to define the technical direction for our infrastructure, ensuring that it scales cost-effectively to support our growth.
Infrastructure Management: Utilize tools like Terraform, GitHub Actions, and scripting languages to manage and optimize our infrastructure and CI/CD systems.
Cloud Infrastructure Expertise: Become an expert in our technology stack, including AWS (RDS, ECS, EC2, S3, Lambda), Cloudflare, Redis, DNS, Docker, and the rest of the infrastructure platform.
Observability Incident Response: Use observability tools such as DataDog and AWS CloudWatch to monitor platform health, troubleshoot performance issues, and identify underlying causes. Participate in the on-call rotation to respond to incidents.
Disaster Recovery Incident Management: Ensure disaster recovery and incident response plans are regularly exercised and improved, using industry practices like gamedays and chaos engineering.
Developer Experience: Own and improve the developer experience by refining the development, testing, and continuous deployment processes to make it safer, faster, and easier for engineers to work.
CI/CD Leadership: Be an expert in CI/CD principles, empowering engineers to deliver high-quality services to production continuously.
Mentorship Collaboration: Support software engineers by pairing, mentoring, and demonstrating effective engineering practices. Facilitate understanding of the production deployment process and performance debugging.
DevOps Practice Leadership: Define and manage platform engineering decisions, ensuring all engineers on the on-call rota are well-prepared for incident response.
Requirements
Experience architecting and supporting cloud-native web application infrastructure, ideally using AWS services like RDS, ECS, EC2, S3, and Lambda.
Hands-on experience with containers and schedulers (e.g., Amazon ECS) and expertise with automated configuration management systems such as Terraform.
Strong understanding of Linux, networking, and security.
Experience supporting database administration and performance, with a focus on scalability and maintainability.
Passion for automating processes and improving the developer experience.
Experience working in a DevOps environment, closely collaborating with software engineers.
Proficiency in version control (e.g., Git) and the ability to use it effectively to structure and communicate your work.
Good to Have
Programming experience in Ruby, JavaScript, or Go.
Experience managing relationships with third-party suppliers, such as AWS and Cloudflare.
Familiarity with gamedays, chaos engineering, and other industry practices to enhance platform resilience.
Experience with disaster recovery and business continuity planning.