Home
Communities
Companies
- Companies
  
  Discover best places to work
- Compare Companies
  
  Compare & find best workplace
- Add Office Photos
  
  Bring your workplace to life
- Add Company Benefits
  
  Highlight your company's perks
Reviews
- Company reviews
  
  Read reviews for 6L+ companies
- Write a review
  
  Rate your former or current company
Salaries
- Browse salaries
  
  Discover salaries for 6L+ companies
- Salary calculator
  
  Calculate your take home salary
- Are you paid fairly?
  
  Check your market value
- Share your salary
  
  Help other jobseekers
Interviews
- Company interviews
  
  Read interviews for 40K+ companies
- Campus placements
  
  Interviews questions for 1K+ colleges
- Share interview questions
  
  Contribute your interview questions
Jobs
Awards

RATE NOW!
- ABECA 2025
  
  RATE NOW!
  
  AmbitionBox Employee Choice Awards - 4th Edition
- ABECA 2024
  
  AmbitionBox Employee Choice Awards - 3rd Edition
- AmbitionBox Best Places to Work 2022
  
  2nd Edition
- AmbitionBox Best Places to Work 2021
  
  1st Edition

Add office photos

Engaged Employer

W Energy Software

Compare

2.7

based on 4 Reviews

3 W Energy Software Jobs

W Energy Software - Site Reliability Engineer - Grafana/Prometheus (4-6 yrs)

W Energy Software

2.7

based on 4 Reviews

4-6 years

W Energy Software

posted 12d ago

Job Role Insights

Flexible timing

Key skills for the job

Oracle DBA Incident Management SQL Server Site Reliability Engineering Change Management Monitoring Tools

+ 3 more

Job Description

Site Reliability Engineer (SRE)

Description :

We seek an experienced Site Reliability Engineer (SRE) to ensure our production systems' reliability, scalability, and performance.

This role will leverage monitoring and metrics tools such as Azure Metrics, Grafana, and Prometheus to identify and resolve performance issues proactively.

You will work closely with engineering teams to maintain high system availability, manage incidents, and optimize production environments for seamless operations.

As an SRE, you'll have the autonomy to make critical decisions while receiving support and guidance as needed.

We value intelligence, creativity, and curiosity, and we're committed to providing opportunities for growth and learning.

Our team fosters a flexible, respectful work environment where collaboration and communication are encouraged at all levels, including with the executive team.

As part of our Infrastructure team, your contributions will be critical to our ongoing success and deeply appreciated.

Key Responsibilities :

Production Reliability :

- Maintain and enhance the reliability and availability of production systems, ensuring fault tolerance and minimal downtime.

- Design and support high-availability configurations, including database clustering and read replication.

Incident Response :

- Respond to and resolve production incidents in real time, leveraging monitoring tools to diagnose and address issues effectively.

Performance Management :

- Use metrics from Azure Monitor, Grafana, and Prometheus to identify and resolve performance bottlenecks across applications, infrastructure, and databases.

- Implement optimizations to improve overall system efficiency based on performance data.

Change Management :

- Plan and execute system changes with a focus on minimizing risk and maintaining operational stability.

Monitoring & Metrics Collection :

- Develop and maintain monitoring systems to collect real-time infrastructure and application metrics.

- Create and refine Grafana dashboards to visualize system health and performance effectively.

Troubleshooting & Root Cause Analysis (RCA) :

- Conduct thorough investigations into production issues, analyzing system metrics and logs to identify root causes.

- Document and implement permanent fixes to prevent issue recurrence.

Collaboration :

- Collaborate with engineering teams to address real-time alerts, performance anomalies, and application behavior issues.

Requirements :

- Strong expertise with monitoring and metrics tools, including Azure Monitor, Grafana, and Prometheus.

- Proficiency in SQL Server administration, including performance tuning, clustering, and read replication.

- Solid experience in real-time monitoring and performance troubleshooting within cloud environments.

- Proficiency in Linux/Unix and Windows system administration.

- Experience with cloud platforms (e.g, Azure, AWS, or GCP) for deploying and scaling production systems.

- Strong scripting/programming skills (e.g, Python, PowerShell, or Bash).

- Knowledge of infrastructure-as-code tools (e.g, Terraform, Ansible).

- Administer and manage user accounts using Active Directory (AD).

- Integrate authentication systems with Single Sign-On (SSO) solutions (e.g, SAML, OAuth, OpenID Connect)

Experience :

- Bachelor's degree in engineering, computer science, accounting, finance, MIS, or a related field.

- 4-6 years of experience working with cloud systems (e.g, AWS/Azure) and SaaS environments.

- 3+ years of experience in Site Reliability Engineering, DevOps, or similar roles.

- Proven track record of diagnosing and resolving performance issues using monitoring tools like Grafana, Prometheus, or DataDog.

- Strong analytical and problem-solving skills.

- Excellent written and verbal communication skills, with the ability to collaborate effectively across teams.

- Detail-oriented with a proactive approach to maintaining production reliability.

- Ability to manage multiple priorities and tasks effectively.

Preferred Qualifications :

- AWS certification (Solution Architect or Cloud Practitioner).

- Experience with automation tools such as Jenkins, Ansible, and Terraform.

- Database administration experience.

- Networking experience, including VPNs and routing.

- Security knowledge related to AWS/Azure.

Working Hours :

- This role involves a rotational shift schedule, including night, morning, and regular day shifts