Home
Communities
Companies
- Companies
  
  Discover best places to work
- Compare Companies
  
  Compare & find best workplace
- Add Office Photos
  
  Bring your workplace to life
- Add Company Benefits
  
  Highlight your company's perks
Reviews
- Company reviews
  
  Read reviews for 6L+ companies
- Write a review
  
  Rate your former or current company
Salaries
- Browse salaries
  
  Discover salaries for 6L+ companies
- Salary calculator
  
  Calculate your take home salary
- Are you paid fairly?
  
  Check your market value
- Share your salary
  
  Help other jobseekers
- Gratuity calculator
  
  Check your gratuity amount
- HRA calculator
  
  Check how much of your HRA is tax-free
- Salary hike calculator
  
  Check your salary hike
Interviews
- Company interviews
  
  Read interviews for 40K+ companies
- Campus placements
  
  Interviews questions for 2K+ colleges
- Share interview questions
  
  Contribute your interview questions
Jobs
Awards

WINNERS AWAITED!
- ABECA 2025
  
  WINNERS AWAITED!
  
  AmbitionBox Employee Choice Awards - 4th Edition
- ABECA 2024
  
  AmbitionBox Employee Choice Awards - 3rd Edition
- AmbitionBox Best Places to Work 2022
  
  2nd Edition
- AmbitionBox Best Places to Work 2021
  
  1st Edition

Add office photos

Employer? Claim Account for FREE

InfoService

Compare

4.0

based on 266 Reviews

19 InfoService Jobs

Site Reliability Engineer - ELK Stack (5-7 yrs)

Info Services

4.0

based on 266 Reviews

5-7 years

InfoService

posted 1mon ago

Job Role Insights

Flexible timing

Key skills for the job

Python Kubernetes VMware Site Reliability Engineering Monitoring Tools Bash Scripting

Job Description

Role : Site Reliability Engineer (SRE) - Observability and Telemetry.

Job Summary :

We are seeking a highly skilled Site Reliability Engineer (SRE) - Observability and Telemetry to join our dynamic and innovative team.

The ideal candidate will have a deep understanding of observability principles, infrastructure monitoring, and performance optimization in virtualized and containerized environments.

This role will focus on designing, building, and maintaining observability platforms to ensure the reliability, scalability, and performance of our systems.

Key Responsibilities :

- Design and Implement Observability Solutions : Develop and maintain scalable observability systems, ensuring robust telemetry, logging, and monitoring across cloud-native and hybrid infrastructures.

- Monitoring and Alerting : Create effective monitoring strategies using tools such as Prometheus, Grafana, and ELK Stack to detect anomalies and ensure system health.

- Performance Optimization : Develop and implement performance dashboards and reports to track system metrics, resource utilization, and application behavior.

- Telemetry Integration : Drive adoption and implementation of OpenTelemetry to enhance distributed tracing, logging, and metrics collection across microservices and containerized applications.

- Infrastructure Management : Collaborate with infrastructure teams to improve observability for virtualized environments (VMware) and container orchestration platforms (Kubernetes).

- Automation : Develop and enhance automated solutions for incident response, alert management, and system health reporting to reduce manual intervention and improve reliability.

- Capacity Planning and Reliability : Proactively analyze performance trends and system logs to forecast capacity needs and ensure system reliability.

-Collaboration and Documentation : Work closely with development, operations, and infrastructure teams to promote best practices in observability and provide clear documentation and training on tools and processes.

Required Skills and Experience :

Proven Expertise in Observability Tools :

- Hands-on experience with Prometheus, Grafana, ELK Stack (Elasticsearch, Logstash, Kibana), and OpenTelemetry for monitoring, logging, and tracing.

- Strong Knowledge of Virtualized and Containerized Environments:.

- Experience working with VMware and Kubernetes platforms for managing and monitoring system resources.

Dashboards and Visualization : .

- Proven ability to design, build, and optimize management dashboards that visualize critical performance and reliability metrics.

Scripting and Automation :

- Proficiency in scripting languages such as Python, Bash, or Go to automate observability workflows.

Infrastructure as Code :

- Familiarity with tools like Terraform, Ansible, or Helm for automated infrastructure deployment and configuration management.

Strong Analytical and Problem-Solving Skills :

- Ability to analyze complex system behaviors, troubleshoot performance bottlenecks, and implement data-driven optimizations.

Collaboration and Communication :

- Excellent interpersonal skills to work effectively with cross-functional teams and communicate complex technical concepts to diverse stakeholders.

Preferred Qualifications :

- Experience with service mesh architectures and tools like Istio or Linkerd for observability in microservices environments.

- Knowledge of cloud platforms (AWS, Azure, GCP) and their native monitoring solutions.

- Familiarity with security and compliance monitoring frameworks and tools.