Site Reliability Engineer - Docker/Kubernetes (5-8 yrs)
IT Firm
posted 1mon ago
Key skills for the job
We are looking for an experienced SRE Engineer to manage production systems and optimize system reliability, scalability, and performance.
Key Responsibilities :
- Provide production support and troubleshoot real-time issues.
- Develop and maintain CI/CD pipelines using Jenkins and Git/Bitbucket.
- Manage deployments with Docker and Kubernetes.
- Set up observability tools (Grafana, Prometheus, Instana).
- Automate infrastructure using Terraform and follow SRE practices.
Required Skills :
- Production Support, Docker, Kubernetes
- CI/CD (Jenkins, Git/Bitbucket)
- Observability (Grafana, Prometheus)
- Terraform, TypeScript, Python
- SRE principles
Responsibilities :
System Reliability & Availability : Ensure that the services are highly available, reliable, and scalable in both production and non-production environments.
Incident Management: Lead the investigation and resolution of incidents, identify the root causes, and ensure the recovery of services. You will also contribute to postmortems and implement preventative measures.
Monitoring & Observability :
- Build and maintain monitoring and alerting systems. Implement metrics, logs, and tracing to ensure transparency into system health and performance.
Automation :
Develop and maintain automation tools and systems to reduce manual intervention and improve operational efficiency.
Capacity Planning : Work with the team to forecast capacity needs and implement scaling solutions to ensure our systems are always prepared for increased load.
Performance Optimization : Identify and eliminate bottlenecks and optimize the performance of critical systems.
Collaboration : Work closely with development, QA, and operations teams to ensure smooth deployment and transition of code to production environments.
Security & Compliance : Ensure that security best practices are followed across our infrastructure. Assist with vulnerability management and compliance tasks.
Disaster Recovery: Design and implement disaster recovery and backup strategies to ensure business continuity.
Required Qualifications :
- Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent experience.
- 3+ years of experience in Site Reliability Engineering, DevOps, or similar roles.
- Strong experience with cloud platforms (AWS, GCP, Azure).
- Proficiency in infrastructure automation tools (e.g., Terraform, Ansible, Puppet, Chef).
- Expertise in containerization and orchestration tools (Docker, Kubernetes).
- Experience with CI/CD tools and pipelines (e.g., Jenkins, GitLab, CircleCI).
- Knowledge of monitoring and observability tools (e.g., Prometheus, Grafana, ELK stack, Datadog, New Relic).
- Solid understanding of Linux/Unix systems and networking fundamentals.
- Programming/scripting skills in at least one language (e.g., Python, Go, Bash, Ruby, or Java).
- Strong troubleshooting skills and the ability to debug complex, distributed systems.
- Excellent communication and collaboration skills.
Preferred Qualifications :
- Experience with infrastructure as code (IaC) and configuration management tools.
- Familiarity with microservices architecture.
- Experience with performance tuning and optimization in a large-scale production environment.
- Knowledge of security practices and tools related to cloud infrastructure.
- Understanding of Service-Level Objectives (SLOs) and Service-Level Indicators (SLIs).
What We Offer :
Functional Areas: Software/Testing/Networking
Read full job description10-16 Yrs