19 InfoService Jobs
Site Reliability Engineer - ELK Stack (5-7 yrs)
InfoService
posted 15hr ago
Flexible timing
Key skills for the job
Role : Site Reliability Engineer (SRE) - Observability and Telemetry.
Job Summary :
We are seeking a highly skilled Site Reliability Engineer (SRE) - Observability and Telemetry to join our dynamic and innovative team.
The ideal candidate will have a deep understanding of observability principles, infrastructure monitoring, and performance optimization in virtualized and containerized environments.
This role will focus on designing, building, and maintaining observability platforms to ensure the reliability, scalability, and performance of our systems.
Key Responsibilities :
- Design and Implement Observability Solutions : Develop and maintain scalable observability systems, ensuring robust telemetry, logging, and monitoring across cloud-native and hybrid infrastructures.
- Monitoring and Alerting : Create effective monitoring strategies using tools such as Prometheus, Grafana, and ELK Stack to detect anomalies and ensure system health.
- Performance Optimization : Develop and implement performance dashboards and reports to track system metrics, resource utilization, and application behavior.
- Telemetry Integration : Drive adoption and implementation of OpenTelemetry to enhance distributed tracing, logging, and metrics collection across microservices and containerized applications.
- Infrastructure Management : Collaborate with infrastructure teams to improve observability for virtualized environments (VMware) and container orchestration platforms (Kubernetes).
- Automation : Develop and enhance automated solutions for incident response, alert management, and system health reporting to reduce manual intervention and improve reliability.
- Capacity Planning and Reliability : Proactively analyze performance trends and system logs to forecast capacity needs and ensure system reliability.
-Collaboration and Documentation : Work closely with development, operations, and infrastructure teams to promote best practices in observability and provide clear documentation and training on tools and processes.
Required Skills and Experience :
Proven Expertise in Observability Tools :
- Hands-on experience with Prometheus, Grafana, ELK Stack (Elasticsearch, Logstash, Kibana), and OpenTelemetry for monitoring, logging, and tracing.
- Strong Knowledge of Virtualized and Containerized Environments:.
- Experience working with VMware and Kubernetes platforms for managing and monitoring system resources.
Dashboards and Visualization : .
- Proven ability to design, build, and optimize management dashboards that visualize critical performance and reliability metrics.
Scripting and Automation :
- Proficiency in scripting languages such as Python, Bash, or Go to automate observability workflows.
Infrastructure as Code :
- Familiarity with tools like Terraform, Ansible, or Helm for automated infrastructure deployment and configuration management.
Strong Analytical and Problem-Solving Skills :
- Ability to analyze complex system behaviors, troubleshoot performance bottlenecks, and implement data-driven optimizations.
Collaboration and Communication :
- Excellent interpersonal skills to work effectively with cross-functional teams and communicate complex technical concepts to diverse stakeholders.
Preferred Qualifications :
- Experience with service mesh architectures and tools like Istio or Linkerd for observability in microservices environments.
- Knowledge of cloud platforms (AWS, Azure, GCP) and their native monitoring solutions.
- Familiarity with security and compliance monitoring frameworks and tools.
Functional Areas: Software/Testing/Networking
Read full job descriptionPrepare for Site Reliability Engineer roles with real interview advice