2 People Decode Solutions Jobs
10-17 years
AWS HPC Subject Matter Expert - Cluster Management (10-17 yrs)
People Decode Solutions
posted 1mon ago
Fixed timing
Key skills for the job
Role : AWS High Performance Computing (HPC) Subject Matter Expert Position
Overview :
We are seeking an experienced Subject Matter Expert in AWS High Performance Computing to architect, implement, and optimize HPC solutions on AWS. This role will provide technical leadership in designing and managing large-scale computational workloads, parallel computing environments, and HPC clusters in the AWS cloud.
Key Responsibilities :
- Design and implement scalable HPC architectures on AWS for complex computational workloads
- Provide technical leadership in HPC solution architecture, including cluster management, job scheduling, and workflow optimization
- Guide teams in implementing best practices for AWS HPC services including Parallel Cluster, Batch, FSx for Lustre, and EFA Optimize cost and performance of HPC workloads on AWS
- Develop automation solutions for HPC infrastructure deployment and management
- Lead technical discussions with stakeholders to understand computational requirements and propose solutions
Required Qualifications :
- Bachelor's degree in Computer Science, Engineering, or related technical field
- 7+ years of experience in HPC systems administration or architecture
- 5+ years of hands-on experience with AWS services AWS
- Professional level certification (Solutions Architect or DevOps Engineer)
- Strong experience with Linux/Unix systems administration
- Expertise in HPC schedulers (Slurm, AWS Batch, GridEngine)
- Proficiency in scripting languages (Python, Bash, etc.)
Preferred Qualifications :
- Master's degree or Ph.D. in related field Experience with container technologies (Docker, Singularity)
- Knowledge of ML/AI frameworks and their HPC requirements
- Expertise in parallel programming (MPI, OpenMP) Experience with CFD, FEA, or other scientific computing applications
- Background in research computing or scientific domains
Technical Skills :
- AWS Services
- Expertise AWS Parallel Cluster AWS Batch Amazon FSx for Lustre Elastic Fabric Adapter (EFA) EC2 Instance Types (especially HPC-optimized) S3 and storage solutions
- CloudFormation/CDK
- AWS Identity and Access Management (IAM) HPC & Computing Skills Cluster management and orchestration
- Job scheduling and workload management Parallel file systems
- Performance optimization Network optimization Queue management Resource monitoring and metrics
- Infrastructure as Code (IaC) HPC architecture documentation
- Performance optimization reports Best practices guides
- Technical training materials Implementation playbooks Cost optimization strategies .
Functional Areas: Other
Read full job description