You will leverage Site Reliability Engineering best practices and ITIL Solutions Architecture framework to devise incident management strategies.
Incident Commander, change manager, and a senior technical resource responsible for preventing, identifying, triaging, documenting, investigating, mitigating, and recovering from site/service impacting incidents across Groupon s ~300+ globally dispersed services.
Facilitating the coordination and resolution of Post Mortems through best practices, and overseeing Problem Management.
Dedicated project time to work on a number of interesting and engaging projects.
Working as part of the Incident Management team (Shift Monday-Friday with one weekend primary on-call every 10 weeks)
We re excited about you if you have:
6+ years administering Linux system environments, as well as complete root cause analysis of site impacting issues.
4+ years of experience creating unique Splunk or Kibana search queries to identify, resolve, and prevent incidents and outages, and have experience owning all impacting events until resolution; including coordination with Subject Matter Experts, triage tasks, creating all associated documentation, complete action items, and Post Mortem. 6+ years experience with web applications operations and root cause analysis
6+ years of experience developing policies and procedures that improve overall production stability.
Good communication, consulting, and collaboration skills interfacing with senior leadership teams.
Experience with one or more programming languages (Python, Ruby, Java) is a plus
A plus if you have a BS, MS or PhD in Computer Sciences or related fields.
A plus if you have designed and created tools to manage the site and services.