i
Trantor
38 Trantor Jobs
Site Reliability Engineering Lead - DevOps (7-9 yrs)
Trantor
posted 14hr ago
Flexible timing
Key skills for the job
Must- Have Qualifications :
- Bachelor's degree in Computer Science, Engineering, or a related field.
- 7 years of experience in an SRE or similar operations-focused role.
- Strong experience with AWS cloud services, particularly in managing production environments with high availability requirements.
- Proficiency in infrastructure as code (IaC) using Terraform, CloudFormation, or CDK.
- Expertise in monitoring and observability tools like Prometheus, Grafana, CloudWatch, Datadog, or similar.
- Proficiency in scripting languages such as Python, Shell, or Go for automation and custom tooling.
- Hands-on experience with CI/CD pipelines and deployment automation tools (i.e., Jenkins, GitLab CI/CD, AWS CodePipeline).
- In-depth understanding of incident management processes, including troubleshooting, root cause analysis, and on-call protocols.
Good to Have :
- Knowledge of containerization and orchestration (Docker, ECS, Kubernetes) in production environments.
- Familiarity with logging tools and observability practices (i.e., ELK Stack, Fluentd).
- Experience with capacity planning, error budgeting, and balancing reliability with feature delivery.
Roles and Responsibilities :
- Monitor, improve, and maintain high availability and performance for cloud-based infrastructure, ensuring that the environment meets SLAs and error budgets.
- Serve as the first responder for system incidents, providing on-call support, performing root cause analysis, and implementing preventative measures.
- Set up and continuously improve monitoring, logging, and alerting for all critical services, ensuring that teams are promptly notified of any issues.
- Build and maintain scripts to automate routine tasks, such as scaling, deployment rollbacks, and disaster recovery.
- Regularly evaluate system capacity and performance, planning for scaling and redundancy based on traffic patterns and usage.
- Create and maintain detailed documentation for troubleshooting, recovery processes, and runbooks to ensure consistency in incident handling.
Must have :
- Site reliability engineer
- AWS
- IaC
- Shell scripting
- Devops
- -
Functional Areas: Software/Testing/Networking
Read full job descriptionPrepare for Site Reliability Engineer roles with real interview advice
8-9 Yrs
3-10 Yrs
6-9 Yrs
6-8 Yrs