120 CirrusLabs Jobs
5-12 years
Senior Site Reliability Engineer - Incident Management (5-12 yrs)
CirrusLabs
posted 11hr ago
Flexible timing
Key skills for the job
Job Description :
We are CirrusLabs :
Our vision is to become the world's most sought-after niche digital transformation company that helps customers realize value through innovation. Our mission is to co-create success with our customers, partners and community. Our goal is to enable employees to dream, grow and make things happen. We are committed to excellence. We are a dependable partner organization that delivers on commitments. We strive to maintain integrity with our employees and customers. Every action we take is driven by value. The core of who we are is through our well-knit teams and employees. You are the core of a values driven organization.
You have an entrepreneurial spirit. You enjoy working as a part of well-knit teams. You value the team over the individual. You welcome diversity at work and within the greater community. You aren't afraid to take risks. You appreciate a growth path with your leadership team that journeys how you can grow inside and outside of the organization. You thrive upon continuing education programs that your company sponsors to strengthen your skills and for you to become a thought leader ahead of the industry curve.
You are excited about creating change because your skills can help the greater good of every customer, industry and community. We are hiring a talented Senior Site Reliability Engineer (SRE) to join our team. If you're excited to be part of a winning team, is a great place to grow your career.
Experience : 5 - 8 years
Location : Bengaluru
Senior Site Reliability Engineer (SRE) - Mission-Critical SaaS Cloud Products
Key Responsibilities :
Reliability and Performance Management :
- Design, implement, and maintain highly available, scalable, and resilient cloud-native architectures for mission-critical SaaS products.
- Develop and implement SLOs, SLIs, and SLAs to measure and improve service reliability.
- Continuously optimize system performance and resource utilization across multiple cloud platforms.
- Lead incident response efforts, effectively troubleshooting complex issues to minimize downtime and impact.
- Reduce Mean Time to Recover (MTTR) through proactive monitoring, automated alerting, and efficient problem-solving techniques.
- Conduct thorough Root Cause Analysis (RCA) for all major incidents and implement preventive measures.
Observability and Monitoring :
- Design and implement end-to-end observability solutions across our distributed systems.
- Develop and maintain comprehensive monitoring strategies using tools like ELK Stack, Prometheus, Grafana.
Automation and Infrastructure as Code (IaC) :
- Implement Infrastructure as Code practices using tools like Terraform.
- Develop and maintain automated deployment pipelines and CI/CD workflows.
- Create self-healing systems and automate routine operational tasks to reduce manual intervention.
Cloud-Agnostic Architecture :
- Design and implement cloud-agnostic solutions that can operate efficiently across multiple cloud providers.
- Develop expertise in event-driven architectures and related technologies (e.g., Apache Kafka/Eventhub, Redis, Mongo Atlas, IoTHub).
- Implement and manage containerized applications using Kubernetes across different cloud environments.
Continuous Improvement :
- Regularly review and refine operational practices to enhance efficiency and reliability.
- Stay updated with the latest industry trends and technologies in SRE, cloud computing, and DevOps.
- Contribute to the development of internal tools and frameworks to support SRE practices.
Requirements :
- Strong knowledge of cloud platforms - Azure and their associated services.
- Expert in Observability tools (ELK Stack, Dynatrace, Prometheus )
- Expertise in containerization technologies such as Docker and Kubernetes
- Understanding of Event-driven architecture and database technologies (Mongo Atlas, Azure SQL, PostgresDB)
- Proficient in IaaC tools such as - Terraform and GitHub Actions.
- Proficiency in one or more programming languages - Python/.Net/Java
- Strong understanding of networking concepts, load balancing, and security practices.
Functional Areas: Software/Testing/Networking
Read full job descriptionPrepare for Senior Site Reliability Engineer roles with real interview advice
6-11 Yrs
₹ 14 - 24L/yr
Gurgaon / Gurugram
6-11 Yrs
₹ 16 - 22.5L/yr
Bangalore / Bengaluru