15 Truelancer Jobs
Head - Site Reliability Engineering (5-7 yrs)
Truelancer
posted 1mon ago
Fixed timing
Key skills for the job
Key Responsibilities :
Leadership & Strategy :
- Lead and mentor a 20-member SRE team across three PODs, fostering a collaborative and high-performing culture.
- Define and implement the SRE strategy to ensure system reliability, scalability, and performance.
- Collaborate with cross-functional teams, including product management, development, and operations, to align SRE efforts with business objectives.
- Drive the adoption of SRE best practices, such as error budgeting, chaos engineering, and disaster recovery planning.
Observability & Monitoring :
- Implement and manage observability tools such as Splunk, Elasticsearch, Prometheus, and Grafana to monitor system performance and identify issues proactively.
- Establish robust monitoring frameworks to ensure real-time insights into system health and performance.
Reliability Engineering :
- Oversee incident management processes, ensuring SLA compliance and driving reductions in Mean Time to Recovery (MTTR).
- Develop and enforce reliability standards for infrastructure, applications, and services.
- Design and execute disaster recovery plans, ensuring business continuity during unexpected events.
Automation & Optimization :
- Promote automation for workflows, deployments, and incident response to reduce manual efforts and minimize errors.
- Optimize existing processes to enhance operational efficiency and system reliability.
Client Interaction :
- Act as the primary technical point of contact for clients, managing daily tasks, risks, and mitigation plans.
- Ensure client satisfaction by maintaining high system uptime and meeting performance commitments.
Mandatory Technical Skills :
Observability Tools : Expertise in Splunk, Elasticsearch, Prometheus, and Grafana.
Cloud Platforms : Advanced knowledge of AWS, Azure, or Google Cloud.
Containerization & Microservices : Proficiency with Kubernetes and Docker.
DevOps Practices : Strong understanding of CI/CD pipelines, infrastructure as code (IaC), and automation frameworks.
Programming : Hands-on experience with Java, Golang, or JavaScript for building and maintaining tools and services.
Additional Skills and Qualifications :
Soft Skills :
- Exceptional leadership and team management capabilities in a multi-national organization.
- Excellent communication and interpersonal skills to engage with both technical teams and non-technical stakeholders.
- Strong analytical and problem-solving abilities.
Preferred Qualifications :
- Advanced certifications in SRE, Cloud Platforms (AWS, Azure, Google Cloud), or DevOps methodologies.
- Experience with tools like Terraform, Ansible, or Chef for infrastructure automation.
- Familiarity with practices like canary releases and feature flagging.
Education : Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
Why Join Us?
- Lead a talented team of SRE professionals in shaping the future of GRC technology solutions.
- Work on cutting-edge technologies in a dynamic and collaborative environment.
- Enjoy opportunities for career growth and skill development.
- Be part of a company committed to innovation, excellence, and reliability
Functional Areas: Software/Testing/Networking
Read full job descriptionPrepare for Site Reliability Engineer roles with real interview advice