About the Job:
The Data Development Insights & Strategy (DDIS) team is seeking a Senior AI Engineer to design, scale, and maintain our AI model lifecycle framework within Red Hat's OpenShift AI and RHEL AI infrastructures. As a Senior AI Engineer, you will contribute to managing and optimizing large-scale AI models, collaborating with cross-functional teams to ensure high availability, continuous monitoring, and efficient integration of new model updates, while driving innovation through emerging AI technologies.
In this role, you will leverage your expertise in AI, MLOps/LLMOps, cloud computing, and distributed systems to enhance model performance, scalability and operational efficiency. You'll work in close collaboration with the Products & Global Engineering(P&GE) and IT AI Infra teams, ensuring seamless model deployment and maintenance in a secure and high-performance environment. This is an exciting opportunity to drive AI model advancements and contribute to the operational success of mission-critical applications.
What you will do?
- Develop and maintain the lifecycle framework for AI models within Red Hats OpenShift and RHEL AI infrastructure, ensuring security, scalability and efficiency throughout the process.
- Design, implement, and optimize CI/CD pipelines and automation for deploying AI models at scale using tools like Git, Jenkins, and Terraform, ensuring zero disruption during updates and integration.
- Continuously monitor and improve model performance using tools such as OpenLLMetry, Splunk, and Catchpoint, while responding to performance degradation and model-related issues.
- Work closely with cross-functional teams, including Products & Global Engineering(P&GE) and IT AI Infra teams, to seamlessly integrate new models or model updates into production systems with minimal downtime and disruption.
- Enable a structured process for handling feature requests (RFEs), prioritization, and resolution, ensuring transparent communication and timely resolution of model issues.
- Assist in fine-tuning and enhancing large-scale models, including foundational models like Mistral and LLama, while ensuring computational resources are optimally allocated (GPU management, cost management strategies).
- Drive performance improvements, model updates, and releases on a quarterly basis, ensuring that all RFEs are processed and resolved within 30 days.
- Collaborate with stakeholders to align AI model updates with evolving business needs, data changes, and emerging technologies.
- Contribute to mentoring junior engineers, fostering a collaborative and innovative environment.
What you will bring?
- A bachelor's or masters degree in Computer Science, Data Science, Machine Learning, or a related technical field is required.
- Hands-on experience that demonstrates your ability and interest in AI engineering and MLOps will be considered in lieu of formal degree requirements.
- Experience programming in at least one of these languages: Python, with a strong understanding of Machine Learning frameworks and tools.
- Experience working with cloud platforms such as AWS, GCP, or Azure, and have familiarity with deploying and maintaining AI models at scale in these environments.
- As a Senior AI Engineer, you will be most successful if you have experience working with large-scale distributed systems and infrastructure, especially in production environments where AI and LLM models are deployed and maintained. You should be comfortable troubleshooting, optimizing, and automating workflows related to AI model deployment, monitoring, and lifecycle management. We value a strong ability to debug and optimize model performance and automate manual tasks wherever possible.
- Additionally, you should be well-versed in managing AI model infrastructure using containerization technologies like Kubernetes and OpenShift, and have hands-on experience with performance monitoring tools (e.g., OpenLLMetry, Splunk, Catchpoint). We also expect you to have a solid understanding of GPU-based computing and resource optimization, with a background in high-performance computing (e.g., CUDA, vLLM, MIG, TGI, TEI).
- Experience working in Agile development environments.
- Work collaboratively within cross-functional teams to solve complex problems and drive AI model updates will be key to your success in this role.
Desired skills:
- 5+ years of experience in AI or MLOps, with a focus on deploying, maintaining, and optimizing large-scale AI models in production.
- Expertise in deploying and managing models in cloud environments (AWS, GCP, Azure) and containerized platforms like OpenShift or Kubernetes.
- Familiarity with large-scale distributed systems and experience managing their performance and scalability.
- Experience with performance monitoring and analysis tools such as OpenLLMetry, Prometheus, or Splunk.
- Deep understanding of GPU-based deployment strategies and computational cost management.
- Strong experience in managing model lifecycle processes, from training to deployment, monitoring, and updates.
- Ability to mentor junior engineers and promote knowledge sharing across teams.
- Excellent communication skills, both verbal and written, with the ability to engage with technical and non-technical stakeholders.
- A passion for innovation and continuous learning in the rapidly evolving field of AI and machine learning.
Employment Type: Full Time, Permanent
Read full job description