i
W Energy Software
3 W Energy Software Jobs
4-6 years
W Energy Software - Site Reliability Engineer - Grafana/Prometheus (4-6 yrs)
W Energy Software
posted 11d ago
Flexible timing
Key skills for the job
Site Reliability Engineer (SRE)
Description :
We seek an experienced Site Reliability Engineer (SRE) to ensure our production systems' reliability, scalability, and performance.
This role will leverage monitoring and metrics tools such as Azure Metrics, Grafana, and Prometheus to identify and resolve performance issues proactively.
You will work closely with engineering teams to maintain high system availability, manage incidents, and optimize production environments for seamless operations.
As an SRE, you'll have the autonomy to make critical decisions while receiving support and guidance as needed.
We value intelligence, creativity, and curiosity, and we're committed to providing opportunities for growth and learning.
Our team fosters a flexible, respectful work environment where collaboration and communication are encouraged at all levels, including with the executive team.
As part of our Infrastructure team, your contributions will be critical to our ongoing success and deeply appreciated.
Key Responsibilities :
Production Reliability :
- Maintain and enhance the reliability and availability of production systems, ensuring fault tolerance and minimal downtime.
- Design and support high-availability configurations, including database clustering and read replication.
Incident Response :
- Respond to and resolve production incidents in real time, leveraging monitoring tools to diagnose and address issues effectively.
Performance Management :
- Use metrics from Azure Monitor, Grafana, and Prometheus to identify and resolve performance bottlenecks across applications, infrastructure, and databases.
- Implement optimizations to improve overall system efficiency based on performance data.
Change Management :
- Plan and execute system changes with a focus on minimizing risk and maintaining operational stability.
Monitoring & Metrics Collection :
- Develop and maintain monitoring systems to collect real-time infrastructure and application metrics.
- Create and refine Grafana dashboards to visualize system health and performance effectively.
Troubleshooting & Root Cause Analysis (RCA) :
- Conduct thorough investigations into production issues, analyzing system metrics and logs to identify root causes.
- Document and implement permanent fixes to prevent issue recurrence.
Collaboration :
- Collaborate with engineering teams to address real-time alerts, performance anomalies, and application behavior issues.
Requirements :
- Strong expertise with monitoring and metrics tools, including Azure Monitor, Grafana, and Prometheus.
- Proficiency in SQL Server administration, including performance tuning, clustering, and read replication.
- Solid experience in real-time monitoring and performance troubleshooting within cloud environments.
- Proficiency in Linux/Unix and Windows system administration.
- Experience with cloud platforms (e.g, Azure, AWS, or GCP) for deploying and scaling production systems.
- Strong scripting/programming skills (e.g, Python, PowerShell, or Bash).
- Knowledge of infrastructure-as-code tools (e.g, Terraform, Ansible).
- Administer and manage user accounts using Active Directory (AD).
- Integrate authentication systems with Single Sign-On (SSO) solutions (e.g, SAML, OAuth, OpenID Connect)
Experience :
- Bachelor's degree in engineering, computer science, accounting, finance, MIS, or a related field.
- 4-6 years of experience working with cloud systems (e.g, AWS/Azure) and SaaS environments.
- 3+ years of experience in Site Reliability Engineering, DevOps, or similar roles.
- Proven track record of diagnosing and resolving performance issues using monitoring tools like Grafana, Prometheus, or DataDog.
- Strong analytical and problem-solving skills.
- Excellent written and verbal communication skills, with the ability to collaborate effectively across teams.
- Detail-oriented with a proactive approach to maintaining production reliability.
- Ability to manage multiple priorities and tasks effectively.
Preferred Qualifications :
- AWS certification (Solution Architect or Cloud Practitioner).
- Experience with automation tools such as Jenkins, Ansible, and Terraform.
- Database administration experience.
- Networking experience, including VPNs and routing.
- Security knowledge related to AWS/Azure.
Working Hours :
- This role involves a rotational shift schedule, including night, morning, and regular day shifts
Functional Areas: Software/Testing/Networking
Read full job descriptionPrepare for Site Reliability Engineer roles with real interview advice
4-6 Yrs