12 Core Technologies & Solutions Jobs
Site Reliability Engineer (7-20 yrs)
Core Technologies & Solutions
posted 3d ago
Flexible timing
Key skills for the job
Job Description :
- Engage with our product teams to understand requirements, design, and implement resilient and scalable infrastructure solutions
- Operate, monitor, and triage all aspects of our production and non-production environments
- Collaborate with other engineers on code, infrastructure, design reviews, and process enhancements.
- Evaluate and integrate new technologies to improve system reliability, security, and performance
- Develop and implement automation to provision, configure, deploy, and monitor Apple services
- Participate in an on-call rotation providing hands-on technical expertise during service-impacting events
- Design, build, and maintain highly available and scalable infrastructure
- Implement and improve monitoring, alerting, and incident response systems
- Automate operations tasks and develop efficient workflows
- Conduct system performance analysis and optimization
- Collaborate with development teams to ensure smooth deployment and release processes
- Implement and maintain security best practices and compliance standards
- Troubleshoot and resolve system and application issues
- Participate in capacity planning and scaling efforts
- Stay up-to-date with the latest trends, technologies, and advancements in SRE practices
- Contribute to capacity planning, scale testing, and disaster recovery exercises.
- Approach operational problems with a software engineering mindset
- BS degree in computer science or equivalent field with 5+ years of experience
- 5+ years in an Infrastructure Ops, Site Reliability Engineering, or DevOps-focused role.
- Knowledge of Linux operating system principles, networking fundamentals, and systems management.
- Demonstrable fluency in at least one of the following languages : Java, Python, or Go
- Experience managing and scaling distributed systems in a public, private, or hybrid cloud environment
- Develop and implement automation tools and apply best practices for system reliability.
- You will be responsible for the availability & scalability of our services and manage the disaster recovery and other operational tasks.
- Collaborate with the development team to improve application codebase for logging, metrics and traces for observability.
- Collaborate with data science teams and other business units to design, build and maintain the infrastructure that runs machine learning and generative AI workloads.
- Influence architectural decisions with focus on security, scalability and performance.
- Find and fix problems in production, and work to avoid them from happening again
Preferred Qualifications :
- Familiarity with micro-services architecture and container orchestration with Kubernetes.
- Awareness of key security principles including encryption, keys (types and exchange protocols).
- Understanding SRE principles includes monitoring, alerting, error budgets, fault analysis, and automation.
- Strong sense of ownership, with a desire to communicate and collaborate with other engineers and teams.
- Ability to identify and communicate technical and architectural problems, while working with partners and their team to iteratively find solutions.
Functional Areas: Software/Testing/Networking
Read full job description