74 Recro Jobs
Recro.io - Site Reliability Engineer - CI/CD Pipeline (4-6 yrs)
Recro
posted 2mon ago
Flexible timing
Key skills for the job
Job Description:
We are looking for a talented Site Reliability Engineer (SRE) to join our team and help ensure the reliability, scalability, and performance of our applications and services.
As an SRE, you will play a key role in bridging the gap between development and operations, focusing on automation, infrastructure management, and maintaining system health.
You will be responsible for building and maintaining scalable infrastructure, implementing best practices, and monitoring systems to ensure high availability and performance.
Key Responsibilities :
- Design, develop, and maintain scalable, reliable, and secure infrastructure to support applications and services, ensuring that systems are efficient, fault-tolerant, and optimized for performance.
- Collaborate with development and operations teams to design solutions that meet both business and technical requirements for reliability and scalability.
- Implement Site Reliability Engineering (SRE) best practices to drive operational excellence.
- Focus on high availability, performance optimization, and capacity planning to ensure critical systems run efficiently and scale effectively under demand.
- Help set and track Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets to continuously improve system reliability.
- Collaborate with software engineering, operations, and security teams to improve system reliability, observability, and scalability.
- Work closely with development teams to ensure that systems are designed for maintainability, scalability, and easy troubleshooting.
- Contribute to continuous improvement efforts by providing feedback from a reliability and operations perspective.
- Automate routine operational tasks (e.g, deployment, monitoring, incident response) to reduce manual interventions and improve efficiency across the infrastructure.
- Use tools such as Terraform, Ansible, or similar to automate infrastructure provisioning, scaling, and configuration management.
- Monitor system performance using modern monitoring tools (e.g, Prometheus, Grafana, etc.), and implement effective alerting to identify and respond to issues proactively.
- Troubleshoot and resolve incidents to minimize downtime and ensure that services are restored quickly with minimal disruption.
- Participate in the on-call rotation, providing 24/7 support for critical systems when needed.
- Ensure infrastructure and systems comply with relevant security, reliability, and compliance standards.
- Apply security best practices to the infrastructure, ensuring that systems are protected against security threats and vulnerabilities.
- Regularly review and improve the security posture of systems and applications, implementing necessary patches, upgrades, and controls.
Requirements :
Experience :
- 4-6 years of experience in a Site Reliability Engineer (SRE) or similar role, with a proven track record of maintaining and optimizing large-scale systems.
- Strong experience with cloud platforms, particularly Google Cloud Platform (GCP), and other cloud environments like AWS or Azure.
Technical Skills :
- Expertise in DevOps practices such as CI/CD pipelines, Infrastructure as Code (IaC), and automation tools like Terraform, Ansible, or similar.
- Monitoring & Observability experience with tools like Prometheus, Grafana, ELK stack, or equivalent systems to track and visualize infrastructure performance, usage, and issues.
- Proficiency in programming and scripting languages like Python, Bash, or others, with experience in writing scripts for automating tasks, deployments, and workflows.
- Familiarity with Git for version control and experience working with collaborative workflows in a development environment
Functional Areas: Software/Testing/Networking
Read full job descriptionPrepare for Site Reliability Engineer roles with real interview advice
1-3 Yrs