12 G-Tech Jobs
Site Reliability Engineer - Cloud Platform (5-7 yrs)
G-Tech
posted 6d ago
Flexible timing
Key skills for the job
Job Title : Site Reliability Engineer
We're Hiring!
Responsibilities :
- Design, implement, and maintain highly available and reliable infrastructure and services.
- Define and monitor Service Level Objectives (SLOs) and Service Level Indicators (SLIs).
- Implement and manage incident response and on-call processes.
- Conduct post-incident reviews and implement corrective actions.
- Monitor and analyze system performance metrics.
- Identify and resolve performance bottlenecks and scalability issues.
- Implement performance tuning strategies and optimizations.
- Conduct capacity planning and forecasting.
- Automate infrastructure provisioning, configuration, and deployment using Infrastructure as Code (IaC) tools (Terraform, CloudFormation, Ansible).
- Develop and maintain automation scripts and tools for system administration and monitoring.
- Implement and manage CI/CD pipelines for automated deployments.
- Utilize and enhance monitoring and logging tools (Prometheus, Grafana, ELK stack, Datadog).
- Lead incident response efforts and coordinate with cross-functional teams.
- Develop and maintain incident response plans and procedures.
- Analyze incident data and identify patterns and trends.
- Implement proactive measures to prevent future incidents.
- Monitor resource utilization and forecast future capacity needs.
- Implement auto-scaling and load balancing strategies.
- Ensure efficient resource allocation and utilization.
- Implement and maintain security best practices and policies.
- Conduct security audits and vulnerability assessments.
- Ensure compliance with industry standards and regulations.
- Collaborate with development, operations, and product teams to ensure smooth service delivery.
- Communicate effectively with team members and stakeholders.
- Participate in design and code reviews.
- Provide technical guidance and mentorship to junior team members.
- Create and maintain detailed documentation of infrastructure, processes, and procedures.
- Share knowledge and best practices with team members.
- Conduct training sessions and workshops.
- Contribute to the development of internal tools and libraries.
- Stay up-to-date with the latest SRE practices and technologies.
- Research and evaluate new tools and methodologies.
- Identify and implement process improvements.
- Participate in industry events and conferences.
Technical Skills & Qualifications :
- 5+ years of experience as a Site Reliability Engineer or similar role.
- Strong understanding of distributed systems and cloud technologies (AWS, Azure, GCP).
- Proficiency in Infrastructure as Code (IaC) tools (Terraform, CloudFormation, Ansible).
- Experience with containerization and orchestration technologies (Docker, Kubernetes).
- Proficiency in scripting languages (Python, Bash, Go).
- Experience with monitoring and logging tools (Prometheus, Grafana, ELK stack, Datadog).
- Experience with CI/CD pipelines and tools (Jenkins, GitLab CI, Azure DevOps).
- Strong understanding of networking concepts and protocols.
- Excellent problem-solving and debugging skills.
- Strong communication and interpersonal skills.
- Ability to work independently and as part of a team.
- Bachelor's degree in Computer Science, Software Engineering, or a related field.
Preferred Qualifications :
- Experience with service mesh technologies (Istio, Linkerd).
- Experience with serverless computing (AWS Lambda, Azure Functions, Google Cloud Functions).
- Experience with database administration and performance tuning.
- Experience with security tools and practices.
- Experience with incident management tools (PagerDuty, Opsgenie).
- Experience with configuration management tools (Chef, Puppet).
Functional Areas: Software/Testing/Networking
Read full job description