i
Agivant Technologies
17 Agivant Technologies Jobs
Site Reliability Engineer - Cloud Platforms (7-12 yrs)
Agivant Technologies
posted 6d ago
Flexible timing
Key skills for the job
Job Description :
We are looking for a highly skilled Site Reliability Engineer (SRE) with strong engineering and architectural expertise to design, implement, and manage large-scale, mission-critical infrastructure across multiple data centers and cloud providers.
As an SRE, you will be responsible for architecting and optimizing our global infrastructure, enabling development teams to roll out new features efficiently while maintaining high availability and reliability. You will be hands-on with automation, performance tuning, infrastructure scalability, and cloud-native technologies to ensure a seamless user experience for millions of customers.
Key Responsibilities :
1. Architect and implement highly scalable, fault-tolerant, and distributed systems across multi-cloud (OCI, AWS, GCP) and on-premise environments using modern DevOps and SRE principles.
2. Design and deploy next-generation cloud infrastructure with a strong focus on automation, self-healing systems, and performance optimization.
Develop and maintain infrastructure-as-code (IaC) using Terraform and configuration management tools such as Ansible and Puppet for automated provisioning and orchestration.
3. Build and optimize containerized environments using Kubernetes and Docker for seamless deployment and scaling.
4. Drive performance, scalability, and security improvements across our cloud and on-prem infrastructure, ensuring high availability and disaster recovery capabilities.
Monitor, troubleshoot, and resolve complex system issues by implementing advanced observability solutions, logging, and real-time monitoring frameworks.
5. Develop and enforce SRE best practices, including SLI/SLO definition, capacity planning, and incident management strategies.
6. Eliminate toil and automate repetitive tasks using scripting languages such as Python, Golang, or Shell scripting to improve operational efficiency.
7. Collaborate closely with engineering, architecture, and security teams to improve system resiliency, optimize application performance, and streamline CI/CD workflows.
Lead the transition of legacy systems to modern, cloud-native architectures, advocating for DevOps and infrastructure automation.
8. Participate in 24/7 on-call rotations, ensuring rapid response to critical incidents and driving post-mortem analysis for continuous improvement.
Requirements :
1. 7+ years of hands-on experience in a Site Reliability Engineering (SRE) role, with a strong focus on designing, implementing, and managing cloud-native infrastructure.
Proficient with any cloud platform (preferably OCI) -not just operational experience but actual design and implementation expertise.
2. Proven experience in building, deploying, and optimizing infrastructure-as-code (IaC) using Terraform.
3. Strong automation mindset with proficiency in Ansible, Puppet, or other configuration management tools.
4. Hands-on experience with container orchestration using Kubernetes, Docker, and microservices architecture.
5. Advanced scripting and automation skills in Python, Golang, or Shell scripting to eliminate manual operations.
6. Working knowledge of load balancing technologies (HAProxy, Nginx, F5, Varnish, dnsdist) and web servers (Apache, Nginx).
7. Strong understanding of networking, distributed systems, and observability tools (Prometheus, Grafana, ELK stack, Datadog).
8. Experience in designing and implementing highly available, scalable, and secure architectures across cloud and hybrid environments.
9. AWS and/or GCP certifications are a plus but not required.
10. This is not a support-focused role-we are looking for engineers who have built, deployed, and optimized complex distributed systems from the ground up.
Functional Areas: Software/Testing/Networking
Read full job descriptionPrepare for Site Reliability Engineer roles with real interview advice
5-9 Yrs