25 Centific Jobs
10-13 years
Centific - Director - Software Release Management (10-13 yrs)
Centific
posted 13hr ago
Company Description:
Centific is a front-runner in AI-driven innovation, offering a unique blend of services and product-based solutions in digital transformation, software development, and data engineering.
Our cutting-edge technologies and tailored solutions help businesses enhance efficiency, elevate customer experiences, and drive sustainable growth.
With a culture built on collaboration, diversity, and innovation, Centific delivers impactful results through world-class services and proprietary products.
Join us to shape the future of technology and business transformation.
Role Description:
This is a full-time role for a Director SRE at Centific.
He will oversee site reliability engineering, team leadership, software release engineering , system administration, and infrastructure tasks.
The role is located in Chennai.
Key Responsibilities:
Strategic Leadership & Vision:
- Lead and manage the Software Release Management function for all Data and AI products.
- Establish a centralized release management framework for AI and data products that scales with the growing product portfolio.
- Form and lead a high-performing Site Reliability Engineering (SRE) team to ensure the operational stability and performance of all AI and data-driven applications post-release.
- Collaborate with Product, Engineering and Operations teams to align release and SRE strategies with business objectives.
Release Planning & Coordination:
- Oversee the full lifecycle of software and AI model releases, from planning and coordination to post-release evaluation.
- Develop and maintain a detailed release calendar that aligns with the timelines and priorities of various product teams.
- Coordinate release activities with multiple cross-functional teams, ensuring transparent communication of dependencies, risks, and milestones.
- Ensure that all releases are integrated seamlessly into production, minimizing downtime and disruptions to end users.
Site Reliability Engineering (SRE) Team Formation:
- Hire, build, and lead the SRE team responsible for maintaining the reliability, scalability, and performance of all Data and AI products in production.
- Define the roles and responsibilities of the SRE team, ensuring clear alignment with the goals of product engineering and release management.
- Develop and implement SRE best practices, including incident response, root cause analysis, and proactive performance monitoring.
- Establish SLAs, SLOs, and SLIs (Service Level Agreements/Objectives/Indicators) to track and measure the reliability and performance of all services post-release.
- Collaborate with DevOps to ensure that automated CI/CD pipelines integrate seamlessly with SRE processes and monitoring systems.
Process Optimization & Automation:
- Lead the automation of software release processes, with an emphasis on CI/CD pipelines for AI models, data pipelines, and cloud-based AI products.
- Develop infrastructure-as-code practices to improve the scalability and reliability of AI and data systems across production environments.
- Introduce tools for version control, model governance, and monitoring for MLOps and AI model management in production.
- Continuously improve operational procedures to reduce the number of incidents and optimize recovery time.
Risk & Quality Management:.
- Implement comprehensive quality assurance and validation processes to ensure that all AI models, data products, and software releases meet security, performance, and compliance requirements.
- Proactively identify and mitigate risks related to releases, AI model performance, and operational stability in production.
- Conduct post-release reviews and retrospectives to continuously improve both the release process and the reliability of products.
Collaboration & Stakeholder Management:
- Serve as the central point of contact for release management and SRE-related matters, ensuring consistent communication between engineering, product teams, and key stakeholders.
- Facilitate cross-functional collaboration to ensure that releases and operational reliability goals are met efficiently and effectively.
- Provide regular updates on release progress, system reliability, and any potential risks to executives and product leadership.
Innovation & Continuous Improvement:
- Stay up to date with the latest trends in SRE, DevOps, AI/ML, and cloud operations, incorporating new tools and practices to improve the overall reliability and release processes.
- Drive the adoption of cutting-edge tools in MLOps, AI model deployment, and automated incident resolution to continuously optimize operations and model lifecycle management.
- Foster a culture of continuous improvement by encouraging feedback loops and metrics-driven decision-making across both the release management and SRE teams.
Qualifications:
- Bachelor's or Master's degree in Computer Science, Data Engineering, AI/ML, or a related field.
- 10+ years of experience in software release management, with at least 3-5 years in SRE or DevOps environments, preferably in AI or data-driven applications.
- Proven experience building and managing both release management and SRE teams in complex, multi-product environments.
- Strong knowledge of AI/ML operations (MLOps), data pipeline management, and cloud-based AI product deployments.
- Expertise in release management tools (Jenkins, GitLab, Git, Jira) and SRE tools such as Prometheus, Grafana, Datadog, or similar monitoring systems.
- Experience with cloud platforms (AWS, GCP, Azure), containerization (Kubernetes, Docker), and infrastructure automation tools (Terraform, Ansible).
- Excellent problem-solving, organizational, and leadership skills, with a strong track record of driving continuous improvement in both release and operational reliability processes.
Preferred Qualifications:
- Experience deploying and maintaining large-scale AI/ML models in production environments, including monitoring, retraining, and operationalization.
- Familiarity with ITIL, MLOps, or DevOps frameworks and best practices.
- Knowledge of cloud-based services and tools specifically designed for AI/ML (e.g, AWS SageMaker, TensorFlow, PyTorch).
- Demonstrated ability to manage incident response and root cause analysis in complex software ecosystems.
Functional Areas: Other
Read full job descriptionPrepare for Management roles with real interview advice