Application Open:
Full-Time
MBZUAI is seeking a highly skilled and innovative HPC Engineer with a strong hands-on technical experience to join the Robot Learning Laboratory, a state-of-the-art core facility. This role is ideal for candidates who have a strong background in HPC and computer engineering, with a passion for GPU cluster management, distributed training optimization, and system performance tuning to support cutting-edge research at scale.
With an emphasis on practical implementation complemented by research-driven innovation, this position focuses on design, deploying, and optimizing high-performance computing infrastructure for AI/ML workloads and robotics research.
Key Responsibilities
HPC Cluster Architecture and Management
- Design and deploy GPU cluster architecture for large-scale AI/ML training
- Configure and manage HPC schedulers (SLURM/PBS/Kubernetes) for resource allocation
- Implement fair-share policies, GPU quotas, and priority queues
- Design high-speed network topology (InfiniBand/RoCE) for optimal performance
- Plan capacity and resource utilization for growing compute demands
- Integrate HPC cluster with data storage systems (Lustre/BeeGFS/Ceph)
Performance Tuning and Optimization
- Benchmark and optimize GPU utilization for deep learning frameworks (PyTorch/TensorFlow)
- Tune distributed training performance (NCCL, Horovod, DeepSpeed, Megatron)
- Optimize data loading pipelines and checkpoint/restore operations
- Profile and analyze system bottlenecks (CPU, GPU, memory, network, I/O)
- Implement performance monitoring and alerting for compute resources
- Optimize MPI and collective communication patterns
Distributed Training Support
- Support researchers with distributed training job configuration
- Troubleshoot multi-node training failures and performance issues
- Implement best practices for gradient synchronization and model parallelism
- Optimize batch sizes, learning rates, and scaling strategies
- Develop tools and scripts for job submission and monitoring
- Create documentation and training materials for HPC users
System Administration and Maintenance
- Install and configure GPU drivers, CUDA, cuDNN, and ML frameworks
- Manage system software stack (OS, libraries, compilers, tools)
- Perform system health checks and preventive maintenance
- Handle hardware failures and coordinate with vendors
- Implement backup and disaster recovery procedures
- Maintain system security and access controls
Collaboration and Innovation
- Work with data architects on storage-compute integration
- Collaborate with MLOps engineers on training pipeline optimization
- Assist AIOps engineers with intelligent resource scheduling
- Evaluate emerging HPC technologies (CXL, GPU Direct Storage, SmartNICs)
- Participate in architecture reviews and capacity planning
- Share knowledge through technical presentations and documentation
Academic Qualifications Required
- Master’s degree in Computer Science, Engineering, or a related field.
- A PhD degree will be preferred.
Professional Experience Required
Essential:
- 3+ years of experience in HPC system administration, resource management, or distributed computing.
- Demonstrated hands-on experience with high-performance computing (HPC), including managing GPU-based clusters for AI/ML workloads.
- Strong Linux system administration skills (e.g., RHEL, Ubuntu) and ability to operate in large-scale compute environments.
- Proficiency with HPC workload schedulers such as Slurm (preferred) or PBS, including job configuration, resource management, and troubleshooting.
- Solid understanding of GPU computing, particularly NVIDIA GPUs, CUDA, and cuDNN, and how to optimize AI/ML workloads for them.
- Experience with distributed training using frameworks such as PyTorch or TensorFlow, and related communication libraries (e.g., NCCL, Horovod).
- Scripting and automation proficiency in Python and Bash to support deployment, monitoring, and operational tasks.
- Experience with high-speed networking concepts and technologies (e.g., InfiniBand, RoCE, RDMA).
- Familiarity with parallel file systems such as Lustre, BeeGFS, or GPFS to support large-scale AI workloads.
- Ability to work in a demanding, service-oriented environment with strong organization, communication, and collaboration skills.
Preferred Skills
- Experience with container orchestration for HPC (e.g., Kubernetes) and/or running AI/ML workloads in containerized environments.
- Experience with cloud-based HPC solutions (AWS ParallelCluster, Azure CycleCloud, or GCP HPC).
- Use of performance and profiling tools (e.g., nvprof, nsys, DCGM, Grafana) for workload optimization and system tuning.
- Exposure to advanced AI training frameworks such as DeepSpeed or Megatron-LM for large-model training.
- Experience with benchmarking frameworks (MLPerf, HPL, HPCG) for system validation and performance characterization.
- Knowledge of emerging technologies relevant to AI/HPC (e.g., GPUDirect Storage, CXL, SmartNICs).
- Contributions to open-source HPC or AI systems projects.
- Relevant industry certifications (e.g., NVIDIA DGX, CKA) or publications in HPC/AI systems venues (SC, ISC, MLSys).