i
Fixity Technologies
64 Fixity Technologies Jobs
Site Reliability Engineer - Terraform/Ansible (8-10 yrs)
Fixity Technologies
posted 2mon ago
Flexible timing
Key skills for the job
Job Description :
DevOps Site Reliability Engineer (SRE) is a professional who combines aspects of software engineering, systems engineering, and operations to ensure the reliability, scalability, and performance of a system or infrastructure. SREs focus on automating operational tasks, monitoring system health, improving system reliability, and ensuring the software runs smoothly in production environments. Here's a breakdown of the role and key responsibilities of a DevOps SRE :
Key Responsibilities :
1. Reliability and Availability :
- Ensuring that services are highly available and meet performance SLAs (Service Level Agreements).
- Design and implement systems for monitoring and alerting.
- Take proactive steps to prevent incidents and reduce downtime.
- Troubleshoot and resolve issues related to infrastructure, applications, and services.
2. Automation and Infrastructure as Code :
- Automating manual tasks such as deployments, scaling, and incident responses using tools like Terraform, Ansible, Chef, or Puppet.
- Managing infrastructure using code to create repeatable and consistent environments.
3. Performance Monitoring and Optimization :
- Setting up and maintaining monitoring systems (e.g., Prometheus, Grafana, Datadog) to track application and system health.
- Identifying performance bottlenecks and optimizing systems to improve speed, capacity, and efficiency.
4. Incident Management :
- Handling incidents when they occur, ensuring that systems are restored as quickly as possible.
- Participate in post-mortems to understand the root causes of issues and create solutions to prevent them in the future.
5. Scalability and Load Balancing :
- Ensuring the system can scale effectively to handle large amounts of traffic or load without performance degradation.
- Implementing load balancing strategies and ensuring infrastructure can handle sudden spikes.
6. Collaboration with Development Teams :
- Working closely with software engineers, developers, and IT operations teams to build, deploy, and maintain applications.
- Integrating development pipelines with CI/CD (Continuous Integration/Continuous Deployment) systems.
- Improving the development process through feedback and collaboration on reliability and performance requirements.
7. Cloud Infrastructure Management :
- Managing cloud environments (e.g., AWS, Google Cloud, Azure) and ensuring they are well architected for high availability and performance.
- Implementing cost-effective and efficient cloud resource management strategies.
8. Security and Compliance :
- Ensuring that security measures are implemented to protect data and prevent security breaches.
- Implementing best practices for securing production environments.
9. Disaster Recovery :
- Planning and implementing disaster recovery strategies to minimize data loss and downtime in case of failures.
- Testing recovery procedures regularly.
Key Skills and Tools :
Programming and Scripting : Proficiency in languages like Python, Go, Ruby, or Bash for automation and scripting tasks.
Cloud Technologies :
- Experience with AWS, Azure, Google Cloud Platform (GCP).
- Familiarity with containerization tools like Docker and orchestration systems like Kubernetes.
Monitoring and Logging : Experience with tools such as Prometheus, Grafana, ELK Stack (Elasticsearch, Logstash, Kibana), New Relic, Datadog, or Splunk.
CI/CD : Familiarity with CI/CD tools such as Jenkins, CircleCI, GitLab CI, Travis CI.
Version Control Systems : Knowledge of Git for source code versioning and collaboration.
Infrastructure as Code (IaC) : Proficiency with tools like Terraform, CloudFormation, or Ansible to define and manage infrastructure.
Networking and Security : A solid understanding of networking protocols, firewalls, and security practices.
Incident Management Tools : Familiarity with tools like PagerDuty, Opsgenie, or VictorOps for managing and responding to incidents.
Functional Areas: Software/Testing/Networking
Read full job descriptionPrepare for Site Reliability Engineer roles with real interview advice