Site Reliability Engineer - Prometheus/Grafana (4-7 yrs)
Exploro Solutions
posted 4d ago
Key skills for the job
Job Role : Site Reliability Engineer
YOE : 4 to 7 yrs
Key Responsibilities :
Payment Monitoring and Alert Triage :
- Monitoring of the Payments Flow Based Alerts across multiple applications in rotation 24 X 7 shifts and identify the issue proactively.
- Triage the alerts by analysing the trends on affected dimensions of payment flow, and co-relate the same with other services metrics, logs and traces to find the root cause along with the documentation of triage.
- Ensure timely escalation and closure of issues reported while working with Engineering Teams of payment Services.
Observability Development :
- Design and implement alerting frameworks using tools like Datadog, Grafana, Kiban a, Splunk, and Prometheus.
- Set up custom dashboards and streamline alerting to reduce noise while ensuring critical issues are addressed.
- Drive the adoption of SLO-based alerting, burn rate metrics, and anomaly detection techniques.
Incident Management :
- Lead incident management efforts from identification to resolution.
- Conduct post-incident reviews and implement preventive measures to avoid recurring issues.
- Maintain detailed documentation and performance reports on incident trends and team efficiency.
Automation and Optimization :
- Automate repetitive processes using programming languages like Python or Java.
- Develop and refine scripts to manage and fine-tune alerts.
- Collaborate with engineering teams to implement solutions that reduce manual effort and operational toil.
Required Skills and Qualifications :
- Proven expertise in SRE Observability Concepts and monitoring architecture design.
- Extensive experience with alerting frameworks like Prometheus, Grafana, Kibana, Splunk, and Datadog.
- Hands-on experience with alert noise reduction and advanced alerting techniques such as anomaly detection and burn rate alerting.
- Strong proficiency in incident management, including analysis, root cause identification, and preventive measures.
- Familiarity with payment monitoring systems and operational requirements.
- Proficient in automation tools and scripting languages like Python or Java.
- Excellent collaboration and communication skills to interact with cross-functional teams.
- Flexibility to work in rotational 24x7 shifts from the office.
Notice Period : Immediate to 20 days
Functional Areas: Software/Testing/Networking
Read full job description