Role & responsibilities:
- System Monitoring and Incident Management: Monitor the health and performance of critical systems, applications, and services. Respond to incidents, troubleshoot issues, and ensure timely resolution to minimize downtime and service disruptions.
- Automation and Scripting: Develop and maintain automation scripts and tools to streamline operational tasks, deployment processes, and infrastructure management.
- Infrastructure Management: Manage and scale the underlying infrastructure, including servers, cloud services, and network components. Implement best practices for configuration management, monitoring, and disaster recovery.
- Release Management: Collaborate with development teams to ensure smooth and reliable software releases. Participate in the design and implementation of deployment strategies.
- Performance Optimization: Identify performance bottlenecks and optimize the system to improve reliability and response times.
- Capacity Planning: Analyze system capacity and plan for future growth to meet increasing demands.
- Security and Compliance: Implement security best practices and ensure compliance with relevant industry standards and regulations.
- Collaboration and Documentation: Work closely with cross-functional teams, including developers, product managers, and operations, to ensure efficient communication and knowledge sharing. Document processes, procedures, and troubleshooting guides.
- On-Call Support: Participate in an on-call rotation to handle urgent issues and incidents outside regular business hours.
Qualifications:
- Experience with Cloud Technologies: Proficiency in working with one or more cloud platforms like AWS, Google Cloud Platform, or Microsoft Azure.
- Programming and Scripting Skills: Strong knowledge of at least one programming language (e.g., Python, Java,) and experience with shell scripting.
- System Administration: Linux/Unix system hands on and good to have administration and networking concepts.
- Monitoring and Logging: Experience with monitoring tools such as Prometheus, Grafana, Nagios, and log management solutions like ELK stack.
- Infrastructure as Code (IaC): Knowledge of Infrastructure as Code tools like Terraform or CloudFormation.
- Automation and Configuration Management: Experience with tools like Ansible, Chef, or Puppet for automating infrastructure management.
- Version Control: Familiarity with version control systems like Git.
- Problem-Solving Skills: Ability to analyze and troubleshoot complex technical issues and can work with other teams to help and streamline Process.
- Communication Skills: Strong verbal and written communication skills to collaborate effectively with team members and stakeholders.
- KPI/Metrics: Understand Key SRE Metrics such as Availability, SLA/SLO, MTTA and MTTR
- Any hands on individual with BCA/MCA and B.Tech background.
Employment Type: Full Time, Permanent
Read full job description