18 InfoService Jobs
AI Infrastructure Engineer - GPU Environments (5-10 yrs)
InfoService
posted 15d ago
Flexible timing
Key skills for the job
Job Title : AI Infrastructure Engineer - GPU & Kubernetes Specialist.
Location : Remote.
Type : Fulltime.
Job Description :
We are seeking a highly skilled and motivated AI Infrastructure Engineer with expertise in GPU environments, Kubernetes orchestration, and relevant certifications to join our innovative infrastructure team.
The ideal candidate will be responsible for designing, implementing, and maintaining scalable AI/ML infrastructure solutions, enabling the efficient deployment of AI models and workflows in production environments.
This role demands a deep understanding of GPU acceleration, Kubernetes ecosystems, and NVIDIA-based platforms.
Key Responsibilities :
- Design, deploy, and manage high-performance AI infrastructure leveraging GPU-based systems and Kubernetes clusters.
- Optimize resource allocation and scaling strategies for AI/ML model training and inference.
- Develop and maintain containerized applications, leveraging Docker and Kubernetes to ensure reliability and scalability.
- Implement and manage Kubernetes-based infrastructure, focusing on GPU scheduling, device plugins, and resource management.
- Utilize NVIDIA technologies such as CUDA, TensorRT, and NVIDIA GPU Cloud (NGC) resources for AI workloads.
- Collaborate with AI/ML engineers to streamline deployment pipelines and optimize AI frameworks on GPU architectures.
- Ensure high availability and performance tuning of AI/ML services.
- Stay current with advancements in Kubernetes, GPU technologies, and AI infrastructure best practices.
- Contribute to CI/CD processes for AI workloads using Kubernetes-native tooling.
- Monitor and troubleshoot system performance, network issues, and infrastructure-related bottlenecks.
Required Skills and Qualifications :
- Bachelor's or Master's degree in Computer Science, Engineering, or related field.
- Proven experience managing GPU-based infrastructure in cloud or on-premise environments.
- Proficiency in Kubernetes, including deployments, Helm, Operators, and GPU device management.
- Strong knowledge of containerization technologies (Docker, Kubernetes).
Certifications :
- Kubernetes Certified (CKA, CKAD, or CKS).
- NVIDIA Certified (NVIDIA AI Enterprise or related certifications).
- Experience with NVIDIA GPU technologies (CUDA, TensorRT, and related libraries).
- Familiarity with AI/ML frameworks (TensorFlow, PyTorch, etc.) and their GPU-accelerated configurations.
- Understanding of distributed systems, high-performance computing (HPC), and scalable architecture design.
- Strong programming and scripting skills (Python, Go, or similar).
Preferred Qualifications :
- Experience with cloud platforms (AWS, GCP, Azure) and Kubernetes-as-a-Service.
- Knowledge of Kubernetes-native observability tools such as Prometheus and Grafana.
- Hands-on experience with NVIDIA's Kubernetes device plugin and GPU Operator.
Functional Areas: Other
Read full job descriptionPrepare for Infrastructure Engineer roles with real interview advice