Essential Responsibilities:
- Understand business requirements and collaborate with Product & DevOps teams to implement highly available, scalable, resilient, cost-efficient solutions in Cloud environments.
- Deploy Observability tools (New Relic, Splunk, ELK, Other open source O11y tools..etc) in our Cloud infrastructure and applications via Terraform and be the SME for these tools.
- Create and configure alerts, dashboards, reports mapping to the Golden signals Latency, Errors, Traffic, Saturation.
- Pioneer the definitions of SLIs, SLOs and Error Budgets for GE Aerospace Digital Workplaces products and services. And, champion the implementation for large scale adoption.
- Perform Root Cause Analysis (RCA) for SLO breaches, Alerts and Incidents. Front-end the troubleshooting and debugging sessions.
- Solve problems relating to critical products, applications, services and create solutions (automations, runbooks..etc.) to prevent problem recurrence.
- Lead the Incident Management + Postmortem processes and collaborate with the Operations team to develop the templates for comms, runbooks and documents.
- Consistently share best practices for reliability, resiliency, performance, and improve processes within and across teams.
- Execute data driven approach to make decisions around capacity needs, Cloud cost optimization and infrastructure stability.
- Prioritize reducing MTTx (Mean Time to Recover/Resolve/Repair) for Production incidents to provide better user experience.
- Propose new design and develop solutions to solve complex problems in application resiliency and availability.
- Be a strong technical mentor for junior team members professionally to help them realize their full potential.
Qualifications/ Requirements:
- Bachelors degree from a recognized university or college with a minimum of 4 years of professional experience OR Diploma with a minimum of 5 years of professional experience OR Higher Secondary Certificate with a minimum of 7 years of professional experience
- A minimum of 2 years of experience in Production Engineering or Site Reliability Engineering roles.
- A minimum of 2 years of experience in Cloud environments (e.g., AWS, Azure) is required.
- A minimum of 2 years of experience in DevOps and Infrastructure domain.
Desired Characteristics:
Technical Expertise:
- Primary role in recent positions must be as an infrastructure or software engineer or SRE working with Cloud technologies, predominately Production facing.
- Expertise in Observability i.e. configuring monitoring & logging tools(e.g. NewRelic, Splunk, CloudWatch, ELK) and proficiency in using them.
- Solid and extensive experience in Cloud environments, specifically AWS or Azure.
- Good programming skills beyond bash/shell scripting. Eg. Python, Java.
- A prior or current certification on the Azure or AWS platforms is a strong plus.
- Configuration management, Infrastructure as Code (IaC), and CI/CD experience (Jenkins, GitLab, Nexus, etc.).
- Solid understanding of operating systems, especially Linux and containerization technologies (e.g. Docker and Kubernetes).
- Ability to work in DevOps culture and in Agile/Scrum model.
- Influence and create new designs, architectures, standards, and methods for large-scale distributed systems.
- Technical mindset focused on automating everything to reduce manual toil
Business and Leadership Expertise:
- Prior experience with Digital Workplace services is a huge plus.
- Should lead by example and adopt the SRE mindset of Blamelessness.
- Able to develop and write modular code to solve complex problems impacting application resiliency and availability.
- Ability to determine how to effectively integrate disparate systems to optimize operational processes.
- Demonstrated strong technical, problem-solving, and analytical skills.
- Solid communication skills at all levels of the organization.
- Able to influence peers and leadership cross-functionally for data driven solutions, hypotheses, and theories.
- Driven by professional curiosity, and a desire to develop deep understanding of products applications, services, and their dependencies.
- Proactively identifies and removes project obstacles or barriers on behalf of the team.
Employment Type: Full Time, Permanent
Read full job description