i
Diamondpick
111 Diamondpick Jobs
Senior Site Reliability Engineer (SRE)
Diamondpick
posted 7d ago
Flexible timing
Key skills for the job
JOB TITLE:
Site Reliability Engineer (SRE)
Level E2
INTRODUCTION:
At NBCUniversal, we believe in the talent of our people. It s our passion and commitment to excellence
that drives NBCU s vast portfolio of brands to succeed. From broadcast and cable networks, news and
sports platforms, to film, world-renowned theme parks and a diverse suite of digital properties, we take
pride in all that we do and all that we represent. It s what makes us uniquely NBCU. Here you can create the extraordinary. Join us.
ABOUT THE ROLE:
The Legal and Privacy Engineering organization is looking for a Site Reliability Engineer (SRE) that is a well-rounded IT professional with strong software troubleshooting skills, software development experience, and strong systems administration skills.
SRE team members are responsible for ensuring the stability, scalability, and performance of our legal and privacy systems by blending software engineering and operations expertise. They are proactively creating monitoring and alerting systems, monitoring the systems to find gaps, addressing gaps before they impact users, responding to issues, and continually improving our systems. They automate away manual processes to increase reliability and reduce operational costs. They will track down defects and come up with innovative solutions to improve reliability and availability.
In this role you will be handling site reliability engineering responsibilities across all systems within the legal and privacy space and working as a larger SRE team to provide continuous coverage.
Responsibilities Include the following:
Monitor system performance and reliability to proactively identify and address potential issues before they impact users.
Develop, maintain, and optimize alerting and monitoring systems to ensure high availability and system performance.
Communicate effectively with stakeholders about system status, downtime, and issues.
Measure and report on system availability and performance against defined SLAs.
Participate in on-call rotations, ensuring timely and effective incident response and resolution.
Conduct thorough root cause analysis of incidents and outages and implement preventive measures to avoid recurrence.
Automate routine tasks and processes to minimize manual intervention and optimize operational efficiency.
Collaborate closely with development teams to ensure new features are designed for reliability, scalability, and effective monitoring.
Plan, test, coordinate, and implement new systems, upgrades, and modifications.
Design, develop, and maintain scalable infrastructure systems to support high-traffic
applications.
Collaborate with vendors and cross-functional teams to ensure seamless integration and alignment of efforts.
Create and maintain documentation for systems, processes, and procedures to ensure
knowledge is shared and accessible.
Assist with documenting system designs, processes, and troubleshooting procedures to facilitate knowledge sharing within the team.
Ensure automated CI/CD deployments run successfully, providing troubleshooting and fallback support as needed to prevent service disruptions.
Manage system capacity planning and scaling strategies to effectively handle growth and traffic fluctuations.
Establish and enforce best practices for security, compliance, and configuration management.
Continuously enhance system reliability by evaluating and integrating new tools, technologies, and best practices.
REQUIREMENTS:
1-3 years of experience as a site reliability engineer, DevOps, or similar role - with a strong focus on systems administration and experience in software engineering
Bachelor s degree in Computer Science, Information Technology, Engineering, or a related field.
Advanced degrees or relevant certifications are a plus.
Proficiency in cloud platforms such as AWS, Azure, or Google Cloud, with experience in managing cloud-based infrastructure.
Strong scripting and automation skills using languages like Python, Bash, or PowerShell.
Experience with configuration management tools (e.g., Ansible, Chef, Puppet, Terraform) and infrastructure-as-code (IaC) tools.
In-depth knowledge of CI/CD pipelines, including tools such as GitLab.
Proficiency in CloudFormation needed and proficiency in AWS CDK a plus
Proficiency in monitoring and alerting tools and the ability to design and optimize alerting systems.
Solid understanding of networking concepts, security best practices, and compliance requirements.
Strong problem-solving skills, with a demonstrated ability to perform root cause analysis and implement effective solutions to prevent future incidents.
Excellent communication and collaboration skills, with the ability to work effectively with cross-functional teams, including development and product teams.
Familiarity with incident management frameworks and experience in participating in on-call rotations.
Ability to manage multiple priorities in a fast-paced environment, with a strong focus on detail and quality.
Employment Type: Full Time, Permanent
Read full job descriptionPrepare for Senior Site Reliability Engineer roles with real interview advice
3-6 Yrs
Noida, Pune
4-6 Yrs
Bhubaneswar, Hyderabad / Secunderabad, Bangalore / Bengaluru