We are a forward-thinking technology company committed to delivering high-performance, scalable, and reliable systems. We are seeking an experienced Site Reliability Engineer (SRE) to join our team, ensuring the stability and efficiency of our infrastructure and services.
Key Responsibilities:
System Reliability and Performance:
Design, implement, and maintain highly available and scalable systems.
Monitor system performance, identify issues, and proactively resolve them.
Conduct root cause analysis for incidents and implement preventive measures.
Automation and Efficiency:
Develop and maintain automation scripts and tools to streamline operations and reduce manual interventions.
Implement infrastructure as code (IaC) practices using tools like Terraform, Ansible, or similar.
Collaboration and Support:
Work closely with development and operations teams to enhance system reliability and performance.
Provide technical support and guidance to other team members on best practices and troubleshooting techniques.
Participate in on-call rotations to ensure 24/7 support for critical systems.
Monitoring and Incident Management:
Set up and maintain monitoring and alerting systems to detect and respond to incidents promptly.
Manage and respond to incidents, ensuring timely resolution and minimal impact on users.
Document incident reports and contribute to post-mortem analysis to drive continuous improvement.
Capacity Planning and Optimization:
Perform capacity planning to ensure systems can handle peak loads and future growth.
Optimize resource utilization and performance to reduce costs and improve efficiency.
Qualifications:
Education:
Bachelor s degree in Computer Science, Information Technology, or a related field.
Experience:
3-8 years of experience in site reliability engineering, DevOps, or a related role.
Proven experience in managing large-scale, high-availability systems.
Skills:
Proficiency in scripting languages such as Python, Bash, or similar.
Strong knowledge of Linux/Unix systems and networking.
Experience with cloud platforms such as AWS, Azure, or Google Cloud.
Familiarity with containerization technologies like Docker and orchestration tools like Kubernetes.
Experience with CI/CD pipelines and tools like Jenkins, GitLab CI, or similar.
Strong problem-solving skills and attention to detail.
Excellent communication and collaboration skills.
Preferred Qualifications:
Experience with configuration management tools like Ansible, Puppet, or Chef.
Knowledge of database systems and caching technologies.
Familiarity with observability tools like Prometheus, Grafana, ELK stack, or similar.
Understanding of security best practices and compliance requirements.