Design, deploy, and maintain observability tools and platforms, including monitoring, logging, and tracing systems.
Ensure optimal configuration and performance of observability tools such as Prometheus, Grafana, ELK stack (Elasticsearch, Logstash, Kibana), Jaeger and cloud (AWS/GCP/Azure) Observability Tools.
Monitoring and Alerting:
Develop and manage dashboards and alerts to monitor the health and performance of applications and infrastructure.
Implement robust alerting mechanisms to detect and notify of anomalies, outages, and system performance issues in real-time.
Logging and Tracing:
Implement centralized logging solutions to aggregate logs from various systems and applications.
Develop and maintain distributed tracing solutions to provide end-to-end visibility into system transactions.
Performance Analysis and Optimization:
Analyze system performance metrics and identify bottlenecks and performance degradation. Understanding of SLOs and SLIs
Work with development and operations teams to remediate performance issues and optimize system performance.
Automation and Scripting:
Create automation scripts to streamline observability tasks and processes.
Develop self-healing mechanisms through automated incident response.
Collaboration and Communication:
Work closely with development, operations, and SRE teams to align observability solutions with business and technical requirements.
Provide guidance and training on observability tools and best practices to other team members.
Documentation and Reporting:
Create and maintain detailed documentation for observability systems, processes, and procedures.
Generate periodic reports and dashboards to provide insights into system performance and reliability.
Qualifications and Experience
Education: Bachelors degree in Computer Science, Information Technology, or a related field. Advanced degree preferred.
Experience:
Minimum of 7+ years of experience in IT infrastructure, with at least 3+ years in a observability or monitoring role.
Proven experience in observability engineering, including deploying and managing observability solutions.
Experience with monitoring tools (e.g., Prometheus, Grafana), logging tools (e.g., ELK stack), and tracing tools (e.g., Jaeger, OpenTelemetry).
Experience with cloud platforms such as AWS, Azure, or Google Cloud and Database like MySQL.
Technical Skills:
Strong understanding of observability concepts including metrics, logging, and tracing.
Proficiency in scripting languages such as Bash, Python, Perl or Go.
Familiarity with containerization (e.g., Docker) and orchestration tools (e.g., Kubernetes) and CI/CD pipelines.
Understanding of IP Network and monitoring on Network device (e.g. Router, Firewall).
Experience with infrastructure as code tools (e.g., Terraform, Ansible).
Soft Skills:
Excellent problem-solving and analytical skills.
Strong communication and collaboration skills.
Ability to work independently and in a team-oriented environment.
Preferred Qualifications:
Experience with APM tools like New Relic, Datadog, or Dynatrace.
Knowledge of service mesh technologies (e.g., Istio).
Open-source contributions or relevant certifications in observability tools and methodologies.
What is in it for you
You get to build the next leading edge connected vehicle platform and internet of things platform
The ability to collaborate with our highly skilled groups who work with cutting edge technologies
High visibility as you support the systems that drive our public facing services
Career growth opportunities
Aeris walks the walk on diversity. We re a brilliant mix of varying ethnicities, religions, cultures, sexual orientations, gender identities, ages and professional/personal/military experiences and that s by design. Diverse perspectives are essential to our culture, innovative process and competitive edge. Aeris is proud to be an equal opportunity employer.