11 InOpTra Digital Jobs
10-18 years
Us
InOpTra Digital - Senior Reliability Engineer - Prometheus (10-18 yrs)
InOpTra Digital
posted 2mon ago
Flexible timing
Key skills for the job
Job Title : Senior Site Reliability Engineer (SRE) - Prometheus
Experience Level : 10+ Years of experience
Location : Seattle/Remote
Employment Type : Full-time
Note : We Are Looking For Only Native Usa Candidates
Job Description :
We are looking for a skilled Senior Site Reliability Engineer (SRE) with deep expertise in Prometheus, Grafana, and Kubernetes to join our remote team. In this role, you will manage and optimize the infrastructure supporting a large-scale hardware monitoring project, ensuring high availability, reliability, and scalability for thousands of server hardware.
Key Responsibilities :
- Monitoring and Observability : Design, implement, and maintain comprehensive monitoring systems using Prometheus and Grafana to track and visualize metrics from thousands of hardware servers.
- Kubernetes Orchestration : Deploy, manage, and optimize applications on Kubernetes clusters, ensuring optimal performance and scalability.
- Automation and Scripting : Develop and implement automation for routine tasks, including alerting, system monitoring, and response mechanisms.
- Incident Management : Troubleshoot, diagnose, and resolve infrastructure incidents, ensuring the uptime and reliability of services.
- Performance Tuning : Optimize system performance, ensuring efficient data storage, querying, and alerting in Prometheus and Grafana environments.
- CI/CD Integration : Collaborate with development teams to integrate monitoring into the CI/CD pipeline and ensure smooth deployments.
- Capacity Planning : Perform capacity analysis and ensure that systems are appropriately scaled to handle increasing load.
- Post Deployment support : Support for monitoring solution once monitoring solution is implemented, troubleshooting incidents.
- Prometheus : Advanced experience in configuring, tuning, and managing Prometheus for large-scale environments.
- Grafana : Proficiency in setting up Grafana dashboards for real-time monitoring and alerting.
- Kubernetes : Strong hands-on experience with managing Kubernetes clusters, deployments, and container orchestration.
- Scripting : Proficiency in scripting languages such as Python or Bash to automate tasks.
- Alerting & Incident Management : Experience setting up advanced alerting and incident management processes.
- Infrastructure as Code (IaC) : Experience with tools like Helm.
- CI/CD Pipelines : Knowledge of CI/CD tools and automation frameworks for seamless deployment.
- Familiarity with external storage for prometheus (ex. Mimir) for high-scale storage backends.
- Experience with any Cloud Platforms (ex. AWS, GCP, Azure) for deploying infrastructure.
- Knowledge of microservices architecture and REST APIs.
- Knowledge of Redfish APIs.
Qualifications :
- 6+ years of hands-on experience as an SRE , DevOps Engineer, or similar role in managing complex infrastructure systems.
- 2+ years of hands-on experience in implementing and configuring prometheus monitoring.
- Strong understanding of DevOps practices and infrastructure automation.
- Proven experience in large-scale monitoring systems and high-availability environments.
- Excellent troubleshooting, analytical, and problem-solving skills.
Functional Areas: Manufacturing
Read full job descriptionPrepare for Senior Reliability Engineer roles with real interview advice
10-18 Yrs
Us
6-11 Yrs
Bangalore / Bengaluru
10-18 Yrs
Bangalore / Bengaluru