Collaborate with our Engineering team to design, implement and maintain optimal solutions on the cloud
Define, automate, and document deployment strategies
Maintain and improve monitoring and alerting systems to preempt outages and take corrective actions
Document and test multi-region disaster recovery strategy
Own SLIs and SLOs for microservices and monitor and maintain them within defined limits
Plan and execute timely infrastructure upgrades and penetration testing to ensure security and compliance
Required skills & qualifications
10+ years of experience in Site Reliability Engineering
Expert in programming and scripting languages like Python, Go, Node.js, bash
Experienced in building CI/CD pipelines, and deployment strategies like Blue-Green, Rolling, and Canary, deployment version management and tools like Jira, Git, and Jenkins
Solid understanding of container orchestration frameworks like Kubernetes, Docker, Istio service mesh, and configuration management
An obsession with metrics, reliability, and uptime, and well-versed in tools like Datadog or NewRelic.
Deep understanding of cloud cost optimization strategies to maintain cost efficiency without sacrificing performance
Expert at infrastructure-as-code frameworks like Terraform, OpenTofu
Sound knowledge of networking and AWS infrastructure components like S3, EKS, Elasticsearch, Aurora, EFS, Elasticache, CodeDeploy, Route53, ELB, etc.
Experience with managing relational and NoSql databases like MySQL, Postgres
Experience with Compliance, security, DevSecOps, production incident roster management and escalation paths, and publishing RCAs
Experience in several of the following areas: database architecture, ETL, business intelligence, AI, machine learning, advanced analytics
Strong sense of ownership and integrity demonstrated through clear communication and collaboration