i
Ascendion
7 Ascendion Jobs
Site Reliability Engineer - Azure Cloud (6-9 yrs)
Ascendion
posted 8d ago
Flexible timing
Key skills for the job
Job Description :
We are looking for an experienced Azure Site Reliability Engineer (SRE) with 6-9 years of experience to support and administer Azure Kubernetes Service (AKS) clusters running critical middleware handling thousands of transactions per second (TPS).
The ideal candidate will have a strong background in Infrastructure as Code (IaC), cloud networking, automation, and observability to ensure high availability, scalability, and reliability.
This role requires an engineering-first mindset, focusing on IaC-driven deployments, automation, monitoring, and operational excellence while maintaining a 99.999% availability target.
Key Responsibilities :
- Deploy, manage, and maintain Azure Kubernetes Service (AKS) clusters with a focus on scalability and availability.
- Handle cluster cutovers, base image updates, and IaC-driven changes.
- Apply SRE principles to ensure high availability and resiliency. Write and maintain Terraform scripts for IaC deployments.
- Manage Kubernetes configurations using Helm charts.
- Automate infrastructure provisioning, scaling, and disaster recovery processes.
- Implement GitOps methodologies using ArgoCD for deployment automation.
- Implement and maintain monitoring & logging solutions using ELK Stack (Elasticsearch, Logstash, Kibana) or Grafana Loki.
- Analyze logs and metrics to proactively detect issues. Integrate OpenTelemetry (preferred) for distributed tracing and observability.
- Ensure compliance with security best practices for handling sensitive data in regulated environments.
- Implement and manage secrets using HashiCorp Vault (preferred).
- Maintain secure cloud networking within Azure.
- Build and optimize CI/CD pipelines using GitHub Actions (preferred) or any CI/CD tool.
- Automate deployments using GitOps principles with ArgoCD.
- Optimize build, release, and rollback processes for high availability.
- Conduct disaster recovery testing and build fault-tolerant systems.
- Respond to production incidents, troubleshoot issues, and implement long-term fixes.
- Participate in an on-call rotation to ensure system reliability.
Required Skills & Qualifications :
Cloud :
- Azure (Must-have) with strong networking expertise.
- Terraform (Hands-on experience required).
Container Orchestration :
- Kubernetes (AKS) with Helm.
- GitHub Actions (preferred) or any CI/CD tool.
- ArgoCD for deployment automation.
- Proficiency in Python or Golang (any one required).
- Experience with ELK Stack or Grafana Loki. Linux with networking skills.
- OpenTelemetry (preferred).
- HashiCorp Vault (preferred).
- Familiarity with security and compliance best practices.
- Bachelor's degree in Computer Science, Engineering, or related fields (or equivalent experience).
- 6-9 years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering.
- Minimum 2 years of hands-on experience as an SRE working with Azure, Kubernetes, and Terraform.
- Experience working in highly available, regulated, and security-sensitive environments.
- Work with highly scalable, mission-critical systems handling thousands of transactions per second.
- Be part of a fast-paced, DevOps-driven engineering culture.
- Leverage the latest cloud-native technologies and automation frameworks.
- Competitive compensation and growth opportunities.
Functional Areas: Software/Testing/Networking
Read full job descriptionPrepare for Site Reliability Engineer roles with real interview advice