2 Groww Jobs
Groww - Senior Site Reliability Engineer - System Administration (4-7 yrs)
Groww
posted 4mon ago
Flexible timing
Key skills for the job
About Groww :
- We are a passionate group of people focused on making financial services accessible to every Indian through a multi-product platform.
- Each day, we help millions of customers take charge of their financial journey.
- Customer obsession is in our DNA.
- Every product, every design, every algorithm down to the tiniest detail is executed keeping the customers' needs and convenience in mind.
- Our people are our greatest strength.
- Everyone at Groww is driven by ownership, customer-centricity, integrity and the passion to constantly challenge the status quo.
- Are you as passionate about defying conventions and creating something extraordinary as we are? Let's chat.
- Expertise and Qualifications.
- We are seeking a highly motivated and experienced Senior Site Reliability Engineer to join our engineering team.
- As an SRE, you will be responsible for ensuring the reliability, availability, scalability, and performance of our applications and infrastructure.
- You will collaborate closely with software developers, platform engineers, and other team members to design, provision, build, and maintain systems that are scalable, secure, and highly available.
Responsibilities :
- Monitor and troubleshoot issues related to system performance, reliability, and security.
- Define and implement Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets to measure and improve service reliability.
- Analyze and report on metrics and trace data using Grafana, prometheus.
- Participate in an on-call rotation to provide 24/7 support for critical production systems.
- Evaluate and automate manual and repetitive tasks to reduce toil and improve system efficiency.
- Design and manage infrastructure using tools like Terraform, Crossplane, or Kubernetes Composite Resource Definitions (XRDs).
- Implement and manage security measures to protect infrastructure and data.
- Coordinate between developers and operations to ensure smooth software releases and timely resolution of production issues.
- Conduct thorough root cause analysis (RCA) of production incidents and implement preventive measures.
- Review and optimize system performance, identify bottlenecks, and implement capacity planning and recovery strategies.
- Maintain comprehensive documentation of systems, processes, and incident responses.
- Continuously seek and implement improvements to infrastructure, processes, and tools to enhance system reliability and performance.
Requirements :
- 4-7 years of relevant work experience.
- Bachelor's or Master's degree in Computer Science or a related field.
- Strong understanding of Linux/Unix systems administration and networking, with troubleshooting skills.
- Must have experience with Kubernetes, Docker, and other containerization technologies.
- Experience with cloud platforms such as GCP, AWS, or Azure is required.
- Strong programming skills in one or more languages such as Go, Python, or Java.
- Experience with monitoring and alerting tools such as Grafana, Prometheus, PagerDuty, or similar technologies is desirable.
- Must have experience with infrastructure provisioning tools such as Terraform, Pulumi, CloudFormation, or similar technologies.
- Strong interpersonal and team collaboration skills.
Functional Areas: Software/Testing/Networking
Read full job descriptionPrepare for Senior Site Reliability Engineer roles with real interview advice
4-7 Yrs
5-8 Yrs