Were seeking a dedicated Site Reliability Engineer to join our team. In this role, you will be responsible for maintaining the reliability, scalability, and performance of our systems. Youll implement best practices for monitoring, incident response, and automation to ensure seamless operations. Your expertise will help us build resilient infrastructure, reduce downtime, and enhance the overall user experience.
Key Responsibilities
Experience working with various monitoring tools. (eg. ELK, Dyntrace, Cloudwatch, Cloud logging, Cloud Monitoring, BMC Surveyor, BMC Patrol, Grafana, Prometheus)
Ensure monitoring and self-healing strategies are implemented and maintained to proactively prevent production incidents.
Perform root cause analysis of production issues
Design and manage on call and escalation processes Nice to Have
Participate in design reviews and production reviews for new features, products, or pieces of infrastructure
Designing and implementing ELK (Elasticsearch, Logstash and Kibana) stack, Prometheus and Grafana solutions for monitoring and alerting.
Debug production issues across services and levels of the stack.
Establish KPIs to demonstrate maturity, efficiency, and value to our business partners
Works as an integral part of the DevOps team with complimentary skills and common goals
L3 Support experience is an asset.
Work to create a Release management process and help with Out-of-business-hour deployments and support (Rotation with team members)
Familiar and comfortable with agile development techniques.