5 Nirdesa Networks Jobs
8-15 years
Neysa - Linux Systems Administrator - HPC Environment (8-15 yrs)
Nirdesa Networks
posted 16hr ago
Key skills for the job
Job title : Linux System Admin HPC
Minimum Experience : 8 to 15 Years
Job level : Mid-Level
Type : Full Time
Job Description :
Day in the life In this role you'll :
- Manage a cutting-edge High-Performance Computing (HPC) grid providing supercomputing capabilities to power complex workloads.
- Ensure optimal uptime and availability of HPC resources, proactively addressing system issues.
- Collaborate with developers and end-users to maximize system efficiency while maintaining system stability.
- Monitor workload distribution, ensuring fairness and adherence to committed resources.
- Partner with infrastructure experts and leverage their experience to sustain a high-performing environment.
- Guide vendors and OEMs to meet the unique technical and operational demands of the HPC environment.
Must-have skills :
HPC Linux Systems Management :
- Administer and maintain Linux-based HPC clusters, including Red Hat, Rocky, Ubuntu, and Debian distributions.
- Perform system-wide maintenance, upgrades, kernel tuning, and security hardening for optimal performance.
- Manage job scheduling systems like SLURM, including configuration, troubleshooting, and optimization.
System Performance Optimization: , strace, and iotop.
- Analyze and optimize HPC workloads using performance monitoring tools such as perf, sar
- Address complex interdependencies in HPC systems, ensuring balanced performance across CPU, GPU, memory, and network resources.
- Implement and validate benchmarking tools to ensure performance aligns with acceptance criteria.
Networking in HPC Environments :
- Configure, monitor, and optimize high-speed interconnects such as InfiniBand and Ethernet.
- Implement network namespaces, VLANs, LACP, and other advanced networking setups.
- Manage firewall configurations using iptables, nftables, or firewalld for secure and efficient data flow.
Storage and Filesystem Management :
- Oversee large-scale storage systems, including RAID, LVM, and high- performance file systems like Lustre or BeeGFS.
- Monitor I/O performance, resolve storage bottlenecks, and manage backups and disaster recovery.
Linux Kernel and Driver Expertise :
- Customize, compile, and manage Linux kernels tailored for HPC environments.
- Debug and optimize device drivers for GPUs and other HPC-specific hardware.
- Handle boot processes, kernel panics, and low-level system configurations.
Cluster and Container Orchestration :
- Deploy and maintain HPC clusters with tools like Kubernetes and Singularity for containerized workflows.
- Manage virtualized HPC workloads using KVM, VMware, or Proxmox.
- Optimize resource scheduling and container orchestration to achieve maximum scalability.
Automation and Scripting :
- Develop advanced shell scripts (Bash, awk, sed) and manage automation tools like Ansible and Terraform.
- Automate configuration management, system monitoring, and incident resolution workflows
What separates the best from the rest Added bonuses if you have experience with :
Preferred Experience :
- Expertise in GPU-based systems, including CUDA profiling, optimization, and troubleshooting.
- Proven ability to debug and optimize C programs interacting with HPC hardware resources using tools like gdb and valgrind.
- Deep understanding of Linux kernel subsystems like scheduling, memory management, PCIe, and swap.
Soft Skills :
- Strong communication and collaboration skills to guide less experienced users and developers.
- Ability to balance user needs with system stability in a high-demand, resource-intensive environment.
- Analytical mindset to foresee and mitigate system-wide impacts of configuration changes.
It would be amazing if you Work with users :
All users are not created equal, and often they may lack (or may not be interested in) understanding of how their actions can affect system performance. Your attitude in dealing with (often frustrating, to you) would be critical in maintaining a good working order.
What you can expect An environment where you can do your best work :
- Work on best HPC infrastructure that challenges the limits of technology.
- Enjoy flexibility in work hours and location while contributing to groundbreaking projects.
- Collaborate with a highly experienced team in a supportive, knowledge- sharing environment.
- Shape the future of computing while achieving personal and professional growth.
Functional Areas: Other
Read full job description7-10 Yrs
Mumbai
8-15 Yrs
Mumbai