8 HireXtra Jobs
Site Reliability Engineer - Cloud Infrastructure (4-7 yrs)
HireXtra
posted 8d ago
Flexible timing
Key skills for the job
Responsibilities :
- Design, implement, and maintain scalable, highly available, and fault-tolerant cloud infrastructure.
- Monitor system health and performance metrics, proactively identifying and resolving issues.
- Conduct regular system audits and security reviews to ensure compliance with best practices.
- Implement automated solutions to prevent and mitigate potential failures.
- Rapidly respond to incidents, troubleshoot issues, and restore service.
- Analyze incident root causes and implement preventive measures to avoid recurrence.
- Develop and maintain incident response plans and playbooks.
- Automate routine operational tasks and infrastructure provisioning using IaC tools (Terraform, CloudFormation).
- Develop and maintain scripts and tools to improve efficiency and reduce manual effort.
- Implement robust monitoring solutions using tools like Prometheus, Grafana, Splunk, and CloudWatch.
- Define and maintain key performance indicators (KPIs) and service-level objectives (SLOs).
- Analyze monitoring data to identify trends and potential issues.
- Conduct controlled experiments to identify weaknesses and improve system resilience.
- Implement chaos engineering practices to proactively test system boundaries.
- Collaborate with development, operations, and security teams to ensure smooth delivery of services.
- Share knowledge and best practices with team members.
Required Skills and Experience :
- Strong hands-on experience with major cloud platforms (AWS, GCP) and their services (S3, Lambda, Kubernetes).
- Deep understanding of cloud networking concepts and protocols.
- Proficiency in automation tools like Terraform, CloudFormation, and Ansible.
- Expertise in monitoring tools like Prometheus, Grafana, Splunk, and CloudWatch.
- Experience with incident response and problem-solving techniques.
- Familiarity with Chaos Engineering principles and tools.-
- Strong scripting skills (Python, Bash, etc.)
- Excellent communication and problem-solving skills.
- Certification in relevant cloud technologies (AWS, GCP, etc.)
- Experience with container orchestration platforms (Kubernetes, Docker)
- Knowledge of security best practices and compliance standards
Functional Areas: Software/Testing/Networking
Read full job descriptionPrepare for Site Reliability Engineer roles with real interview advice
7-9 Yrs
Remote
4-6 Yrs
Hyderabad / Secunderabad