i
Xebia
26 Xebia Jobs
5-14 years
Xebia - Site Reliability Engineer - DevOps (5-14 yrs)
Xebia
posted 12hr ago
Flexible timing
Key skills for the job
Hybrid
Shift : 3 PM - 12 AM
As an SRE, you will work at the intersection of software engineering and system operations to ensure our cloud infrastructure is reliable, efficient, and scalable.
You will be responsible for monitoring, troubleshooting, and automating our systems, with a special focus on AI/ML services deployed on Google Cloud Platform (GCP).
This role is perfect for individuals who are passionate about improving service reliability through automation, observability, and a data-driven approach.
Key Responsibilities :
- Ensure the availability, performance, and scalability of services running on Google Cloud Platform (GCP), particularly for AI/ML services.
- Monitor and optimize cloud-based systems, ensuring uptime and reducing downtime through proactive monitoring and automation.
- Develop and implement automation scripts for infrastructure provisioning, configuration, and deployment.
- Design and manage monitoring and alerting systems, utilizing tools such as Prometheus, Grafana, and Stackdriver, to track key performance indicators and reliability metrics (SLIs, SLOs, SLAs).
- Collaborate with engineering teams to ensure applications are built for high reliability and scalability in a cloud environment, particularly those leveraging AI/ML services on GCP.
- Troubleshoot complex production issues and drive improvements to the infrastructure and application design to enhance reliability and performance.
- Work on incident response and root cause analysis (RCA), identifying areas for improvement and implementing solutions to prevent recurrence.
- Participate in on-call rotations, providing support for production systems and resolving critical incidents quickly and effectively.
- Implement infrastructure-as-code (IaC) practices using tools like Terraform, Ansible, or Google Cloud Deployment Manager.
- Optimize cost and resource usage in GCP, ensuring services are running efficiently and within budget.
Required Skills and Qualifications :
- Experience working as a Site Reliability Engineer (SRE), DevOps engineer, or in a similar role with a focus on cloud infrastructure.
- Strong hands-on experience with Google Cloud Platform (GCP) services (e.g., Compute Engine, Kubernetes Engine, BigQuery, Cloud Storage).
- Familiarity with AI/ML services on GCP, such as AI Platform, TensorFlow, or BigQuery ML.
- Proficient in scripting and automation using languages such as Python, Go, or Bash.
- Solid understanding of monitoring, logging, and alerting tools (e.g., Prometheus, Grafana, Google Stackdriver, or similar).
- Experience with containerization technologies like Docker and Kubernetes.
- Strong knowledge of systems architecture, networking, and distributed systems.
- Experience with infrastructure as code (IaC) tools like Terraform, CloudFormation, or Ansible.
- Ability to troubleshoot complex systems and effectively manage incident response.
- Strong analytical and problem-solving skills with a focus on continuous improvement.
- Excellent communication skills and the ability to collaborate with cross-functional teams.
Preferred Qualifications :
- Google Cloud Platform (GCP) certifications, such as Professional Cloud Architect or Professional Cloud DevOps Engineer.
- Experience with CI/CD pipelines and tools like Jenkins, GitLab CI, or CircleCI.
- Familiarity with AI/ML frameworks and cloud services for machine learning.
- Knowledge of security best practices for cloud environments.
- Experience working in an agile or DevOps-driven environment.
Functional Areas: Software/Testing/Networking
Read full job descriptionPrepare for Site Reliability Engineer roles with real interview advice
5-10 Yrs
Gurgaon / Gurugram, Chennai, Bangalore / Bengaluru
5-10 Yrs
Gurgaon / Gurugram, Chennai, Bangalore / Bengaluru
7-11 Yrs
Pune, Bangalore / Bengaluru, Delhi/Ncr