36 Whitefield Careers Jobs
Site Reliability Engineer (7-10 yrs)
Whitefield Careers
posted 29d ago
Key skills for the job
Job Overview :
We are seeking a highly motivated and experienced Site Reliability Engineer (SRE) to join our team. As an SRE, you will be responsible for ensuring the reliability, 1 availability, and performance of our systems and infrastructure. You will work closely with development, operations, and other engineering teams to build and maintain highly resilient and scalable systems. This role requires a deep understanding of systems engineering principles, a passion for automation, and a data-driven approach to problem-solving.
Responsibilities :
- Design, implement, and maintain systems and infrastructure that meet our reliability and availability targets. Proactively identify and mitigate potential risks to system stability.
- Develop and implement comprehensive monitoring and alerting systems to track system performance and identify bottlenecks.
- Analyze performance data and implement optimizations to improve system efficiency.
- Participate in incident response processes, including troubleshooting, root cause analysis, and post incident reviews.
- Develop and implement strategies to prevent future incidents.
- Automate repetitive tasks and processes to improve efficiency and reduce manual effort.
- Develop and maintain automation tools and scripts.
- Forecast system capacity needs and work with development and operations teams to ensure adequate resources are available.
Required Skills :
- Strong understanding of systems engineering principles and SRE best practices.
- Experience with monitoring and alerting systems e.g., Prometheus, Grafana, Datadog, New Relic.
- Proficiency in at least one scripting language e.g., Python, Bash, Go.
- Experience with configuration management tools e.g., Ansible, Chef, Puppet, Terraform.
- Knowledge of cloud computing platforms (e.g., AWS, Azure, GCP).
- Experience with containerization and orchestration technologies (e.g., Docker, Kubernetes).
- Strong understanding of networking concepts and protocols.
- Excellent troubleshooting and problem-solving skills.
- Strong communication and collaboration skills.
- Experience with incident management processes.
Preferred Skills :
- Experience with distributed systems and microservices architecture.
- Knowledge of database systems (e.g., SQL, NoSQL).
- Experience with CI/CD pipelines and tools.
- Familiarity with service mesh technologies (e.g., Istio, Linkerd).
- Experience with performance testing and tuning.
- SRE certifications
Education : Bachelor's degree in Computer Science, Information Technology, or a related field (preferred).
Functional Areas: Software/Testing/Networking
Read full job description