69 Pylon Management Consulting Jobs
9-14 years
Principal Site Reliability Engineer - Kubernetes/Docker (9-14 yrs)
Pylon Management Consulting
posted 3d ago
Fixed timing
Key skills for the job
About the Role :
We are seeking a highly experienced and visionary Principal Site Reliability Engineer (SRE) to lead our efforts in ensuring the reliability, scalability, and performance of our critical systems. In this role, you will be a technical leader, driving the adoption of SRE principles and practices across the organization. You will be responsible for designing and implementing robust infrastructure, automation, and monitoring solutions to maintain high availability and optimize system performance.
Responsibilities :
SRE Leadership & Strategy :
- Develop and implement SRE strategies and best practices to improve system reliability and performance.
- Lead the design and implementation of highly available and scalable infrastructure solutions.
- Define and enforce service level objectives (SLOs), service level indicators (SLIs), and service level agreements (SLAs).
- Champion a culture of observability, automation, and continuous improvement.
Infrastructure Design & Automation :
- Design and implement infrastructure-as-code (IaC) using tools like Terraform, CloudFormation, or Ansible.
- Architect and manage container orchestration platforms (Kubernetes, Docker Swarm).
- Build and maintain CI/CD pipelines for automated deployments.
- Implement and manage configuration management systems.
Monitoring & Observability :
- Design and implement comprehensive monitoring and logging solutions using tools like Prometheus, Grafana, ELK stack, or Datadog.
- Develop and maintain alerting and incident response procedures.
- Analyze metrics and logs to identify performance bottlenecks and potential issues.
- Implement distributed tracing to understand system behavior.
Incident Management & Response :
- Lead incident response efforts, ensuring timely resolution of critical issues.
- Conduct post-incident reviews to identify root causes and implement preventive measures.
- Develop and maintain runbooks and playbooks for incident response.
- Drive improvements in incident management processes.
Performance Optimization & Capacity Planning :
- Identify and resolve performance bottlenecks through profiling, tracing, and optimization.
- Conduct capacity planning and forecasting to ensure system scalability.
- Optimize resource utilization and reduce operational costs.
Security & Compliance :
- Implement and maintain security best practices across the infrastructure.
- Ensure compliance with relevant industry standards and regulations.
- Conduct security audits and vulnerability assessments.
Mentoring & Knowledge Sharing :
- Mentor and guide junior SREs, fostering a culture of learning and growth.
- Share knowledge and best practices through documentation, presentations, and training sessions.
- Act as a technical leader and subject matter expert.
Technical Skills :
Cloud Platforms :
- Deep expertise in at least one major cloud platform (AWS, Azure, GCP).
- Experience with cloud-native technologies and services.
Containerization & Orchestration :
- Expert-level knowledge of Docker and Kubernetes.
- Experience with container registry services.
Infrastructure as Code (IaC) :
- Proficiency in Terraform, CloudFormation, or Ansible.
CI/CD Tools :
- Experience with Jenkins, GitLab CI, CircleCI, or similar tools.
Monitoring & Logging :
- Expertise in Prometheus, Grafana, ELK stack, Datadog, or similar tools.
Scripting & Automation :
- Strong scripting skills in Python, Bash, or Go.
Operating Systems :
- Expert-level knowledge of Linux system administration.
Networking :
- Deep understanding of networking concepts and protocols (TCP/IP, DNS, HTTP, etc.).
Security :
- Strong understanding of security best practices and tools.
Databases :
- Experience with relational and NoSQL databases.
Distributed Systems :
- Understanding of distributed system principles and architectures.
Qualifications :
- Experience : 9-14 years of experience in Site Reliability Engineering or a related field.
- Education : Bachelor's degree in Computer Science, Software Engineering, or a related field.
- Certifications : Cloud certifications (e.g., AWS Certified DevOps Engineer, Google Cloud Certified Professional DevOps Engineer) are highly desirable.
Soft Skills :
- Exceptional problem-solving and analytical skills.
- Strong communication and interpersonal skills.
- Excellent leadership and mentoring abilities.
- Ability to work effectively in a fast-paced environment.
- Strong sense of ownership and accountability.
- Ability to think strategically and drive innovation.
Benefits :
- Competitive salary and benefits package.
- Opportunity to work on cutting-edge technologies and challenging problems.
- Collaborative and supportive work environment.
- Opportunities for professional development and growth.
- Chance to make a significant impact on the reliability and performance of critical systems
Functional Areas: Other
Read full job descriptionPrepare for Pylon Management Consulting roles with real interview advice