We are seeking a Senior DevOps Engineer (SRE) to manage and optimize large-scale, mission-critical production systems. The ideal candidate will have a strong problem-solving mindset, extensive experience in troubleshooting, and expertise in scaling, automating, and enhancing system reliability. This role requires hands-on proficiency in tools like Kubernetes, Terraform, CI/CD, and cloud platforms (AWS, GCP, Azure), along with scripting skills in Python or Go. The candidate will drive observability and monitoring initiatives using tools like Prometheus, Grafana, and APM solutions (Datadog, New Relic, OpenTelemetry).
Strong communication, incident management skills, and a collaborative approach are essential. Experience in team leadership and multi-client engagement is a plus. Ideal Candidate Profile
Solid 4-6 years of experience as an SRE and DevOps with a proven track record of handling large-scale production environments
Bachelors or Masters degree in Computer Science, Engineering, or a related field
Strong Hands-on experience with managing Large Scale Production Systems
Strong Production Troubleshooting Skills and handling high-pressure situations.
Strong Experience with Databases (PostgreSQL, MongoDB, ElasticSearch, Kafka)
Worked on making production systems more Scalable, Highly Available and Fault-tolerant
Hands-on experience with ELK or other logging and observability tools
Hands-on experience with Prometheus, Grafana & Alertmanager and on-call processes like Pagerduty
Problem-Solving Mindset
Strong with skills - K8s, Terraform, Helm, ArgoCD, AWS/GCP/Azure etc
Good with Python/Go Scripting Automation
Strong with fundamentals like DNS, Networking, Linux
Experience with APM tools like - Newrelic, Datadog, OpenTelemetry
Good experience with Incident Response, Incident Management, Writing detailed RCAs
Experience with Applications best practices in making apps more reliable and fault-tolerant
Strong leadership skills and the ability to mentor team members and provide guidance on best practices.
Able to manage multiple clients and take ownership of client issues.
Experience with Git and coding best practices Good to have
Team-leading Experience
Multiple Client Handling
Requirements gathering from clients
Good Communication Key Responsibilities
Design and Development:
Architect, design, and develop high-quality, scalable, and secure cloud-based software solutions.
Collaborate with product and engineering teams to translate business requirements into technical specifications.
Write clean, maintainable, and efficient code, following best practices and coding standards.
Cloud Infrastructure:
Develop and optimise cloud-native applications, leveraging cloud services like AWS, Azure, or Google Cloud Platform (GCP).
Implement and manage CI/CD pipelines for automated deployment and testing.
Ensure the security, reliability, and performance of cloud infrastructure.
Technical Leadership:
Mentor and guide junior engineers, providing technical leadership and fostering a collaborative team environment.
Participate in code reviews, ensuring adherence to best practices and high-quality code delivery.
Lead technical discussions and contribute to architectural decisions.
Problem Solving and Troubleshooting:
Identify, diagnose, and resolve complex software and infrastructure issues.
Perform root cause analysis for production incidents and implement preventative measures.
Continuous Improvement:
Stay up-to-date with the latest industry trends, tools, and technologies in cloud computing and software engineering.
Contribute to the continuous improvement of development processes, tools, and methodologies.
Drive innovation by experimenting with new technologies and solutions to enhance the platform.
Collaboration:
Work closely with DevOps, QA, and other teams to ensure smooth integration and delivery of software releases.
Communicate effectively with stakeholders, including technical and non-technical team members.
Client Interaction & Management:
Will serve as a direct point of contact for multiple clients.
Able to handle the unique technical needs and challenges of two or more clients concurrently.
Involve both direct interaction with clients and internal team coordination.