99 Natobotics Jobs
4-8 years
Site Reliability Engineer - Terraform/Ansible (4-8 yrs)
Natobotics
posted 1d ago
Key skills for the job
Responsibilities :
- Develop and maintain infrastructure as code (IaC) using Terraform for provisioning and managing cloud resources.
- Automate configuration management and deployment processes using Ansible playbooks and roles.
- Design and implement reusable Terraform modules and Ansible roles to improve efficiency and consistency.
- Manage and maintain version control of infrastructure code using Git.
- Design and implement comprehensive monitoring and alerting solutions to ensure system health and performance.
- Utilize observability tools (Prometheus, Grafana, ELK stack) to gather and analyze metrics, logs, and traces.
- Define and implement service level objectives (SLOs) and service level indicators (SLIs).
- Develop and maintain dashboards and alerts to proactively identify and resolve issues.
- Design and implement high availability and fault-tolerant infrastructure solutions.
- Implement disaster recovery and business continuity plans.
- Identify and eliminate single points of failure.
- Perform capacity planning and performance tuning.
- Identify and automate repetitive tasks and manual processes to reduce toil.
- Develop and maintain automation scripts and tools to improve operational efficiency.
- Continuously improve infrastructure and operational processes.
- Participate in on-call rotations and respond to incidents and alerts.
- Troubleshoot and resolve complex infrastructure and application issues.
- Conduct post-incident reviews and implement corrective actions.
- Collaborate with development, operations, and other teams to ensure smooth deployments and operations.
- Communicate technical concepts clearly and concisely.
- Document infrastructure designs, configurations, and procedures.
- Implement and maintain security best practices for infrastructure and applications.
- Ensure compliance with relevant industry standards and regulations.
Required Skills & Qualifications :
- Experience : 4+ years of experience in Site Reliability Engineering or a related role.
- Infrastructure as Code (IaC) : Strong experience with Terraform for infrastructure provisioning and management.
- Configuration Management : Proficiency in Ansible for configuration management and automation.
- Observability : Experience with observability tools and techniques (Prometheus, Grafana, ELK stack).
- Monitoring and Alerting : Experience in designing and implementing monitoring and alerting systems.
- High Availability : Understanding of high availability and fault-tolerant architectures.
- Scripting : Proficiency in scripting languages (Python, Bash).
- Version Control : Experience with Git.
- Cloud Platforms : Experience with cloud platforms (AWS, Azure, GCP).
- Linux Systems Administration : Strong understanding of Linux systems administration.
- Networking : Basic understanding of networking concepts.
- Problem-Solving : Excellent problem-solving and troubleshooting skills.
- Communication : Strong communication and collaboration skills.
Preferred Qualifications :
- Experience with Kubernetes and container orchestration.
- Experience with CI/CD pipelines.
- Experience with database administration.
- Relevant certifications (AWS Certified DevOps Engineer, Certified Kubernetes Administrator).
- Experience with security tools and practices
Functional Areas: Software/Testing/Networking
Read full job description