Own our monitoring, logging, and alerting tools used by the overall Software Engineering team in order to ensure we are meeting reliability requirements
Learning and adopting technologies that may aide in solving our challenges.
Support the overall Software Engineering team to monitor/alert on any issues they may encounter.
Help respond to service issues and determine how to automatically alert the responsible parties along with context in order to make the service-owner a self-sufficient first-responder
First-responder to issues with shared infrastructure and escalate to other team members as necessary
Work with other teams to get automatic resolutions in place to alleviate need for human response
Participate in on-call rotations to monitor platform/infrastructure issues.
Minimum Required Qualifications :
2+ years in a reliability or technical support-related role
Proficient with ANSI SQL (reading and writing queries)
Must have strong problem-solving analytical skills and the ability to self-manage
Experience with monitoring REST APIs and web services
Experience with high-availability
Experience with leveraging and configuring observability systems such as Datadog, Grafana, Grafana Loki, Promethus, Sumo Logic.
Experience with monitoring relational databases such as MySQL, Aurora/RDS MySQL, PostgreSQL, etc.
Bonus Requirements :
2+ years of experience with Linux/Unix system administration
Experience with monitoring Hadoop ecosystems (e.g. Hadoop, Hive, Presto)
Experience monitoring and analyzing services/applications in service-oriented architecture at the network/server level as well as in containerized space (such as Kubernetes and Docker) #LI-remote