HPC Engineer – Core Facilities

Home Working at MBZUAI Vacancies HPC Engineer – Core Facilities

Vacancy Overview

Application Open:

Full-Time

 

MBZUAI is seeking a highly skilled and innovative HPC Engineer with a strong hands-on technical experience to join the Robot Learning Laboratory, a state-of-the-art core facility. This role is ideal for candidates who have a strong background in HPC and computer engineering, with a passion for GPU cluster management, distributed training optimization, and system performance tuning to support cutting-edge research at scale.

 

With an emphasis on practical implementation complemented by research-driven innovation, this position focuses on design, deploying, and optimizing high-performance computing infrastructure for AI/ML workloads and robotics research.

 

Key Responsibilities

 

HPC Cluster Architecture and Management

  • Design and deploy GPU cluster architecture for large-scale AI/ML training
  • Configure and manage HPC schedulers (SLURM/PBS/Kubernetes) for resource allocation
  • Implement fair-share policies, GPU quotas, and priority queues
  • Design high-speed network topology (InfiniBand/RoCE) for optimal performance
  • Plan capacity and resource utilization for growing compute demands
  • Integrate HPC cluster with data storage systems (Lustre/BeeGFS/Ceph)

Performance Tuning and Optimization

  • Benchmark and optimize GPU utilization for deep learning frameworks (PyTorch/TensorFlow)
  • Tune distributed training performance (NCCL, Horovod, DeepSpeed, Megatron)
  • Optimize data loading pipelines and checkpoint/restore operations
  • Profile and analyze system bottlenecks (CPU, GPU, memory, network, I/O)
  • Implement performance monitoring and alerting for compute resources
  • Optimize MPI and collective communication patterns

Distributed Training Support

  • Support researchers with distributed training job configuration
  • Troubleshoot multi-node training failures and performance issues
  • Implement best practices for gradient synchronization and model parallelism
  • Optimize batch sizes, learning rates, and scaling strategies
  • Develop tools and scripts for job submission and monitoring
  • Create documentation and training materials for HPC users

System Administration and Maintenance

  • Install and configure GPU drivers, CUDA, cuDNN, and ML frameworks
  • Manage system software stack (OS, libraries, compilers, tools)
  • Perform system health checks and preventive maintenance
  • Handle hardware failures and coordinate with vendors
  • Implement backup and disaster recovery procedures
  • Maintain system security and access controls

Collaboration and Innovation

  • Work with data architects on storage-compute integration
  • Collaborate with MLOps engineers on training pipeline optimization
  • Assist AIOps engineers with intelligent resource scheduling
  • Evaluate emerging HPC technologies (CXL, GPU Direct Storage, SmartNICs)
  • Participate in architecture reviews and capacity planning
  • Share knowledge through technical presentations and documentation

Academic Qualifications Required

  • Master’s degree in Computer Science, Engineering, or a related field.
  • A PhD degree will be preferred.

Professional Experience Required
Essential:

  • 3+ years of experience in HPC system administration, resource management, or distributed computing.
  • Demonstrated hands-on experience with high-performance computing (HPC), including managing GPU-based clusters for AI/ML workloads.
  • Strong Linux system administration skills (e.g., RHEL, Ubuntu) and ability to operate in large-scale compute environments.
  • Proficiency with HPC workload schedulers such as Slurm (preferred) or PBS, including job configuration, resource management, and troubleshooting.
  • Solid understanding of GPU computing, particularly NVIDIA GPUs, CUDA, and cuDNN, and how to optimize AI/ML workloads for them.
  • Experience with distributed training using frameworks such as PyTorch or TensorFlow, and related communication libraries (e.g., NCCL, Horovod).
  • Scripting and automation proficiency in Python and Bash to support deployment, monitoring, and operational tasks.
  • Experience with high-speed networking concepts and technologies (e.g., InfiniBand, RoCE, RDMA).
  • Familiarity with parallel file systems such as Lustre, BeeGFS, or GPFS to support large-scale AI workloads.
  • Ability to work in a demanding, service-oriented environment with strong organization, communication, and collaboration skills.

Preferred Skills

  • Experience with container orchestration for HPC (e.g., Kubernetes) and/or running AI/ML workloads in containerized environments.
  • Experience with cloud-based HPC solutions (AWS ParallelCluster, Azure CycleCloud, or GCP HPC).
  • Use of performance and profiling tools (e.g., nvprof, nsys, DCGM, Grafana) for workload optimization and system tuning.
  • Exposure to advanced AI training frameworks such as DeepSpeed or Megatron-LM for large-model training.
  • Experience with benchmarking frameworks (MLPerf, HPL, HPCG) for system validation and performance characterization.
  • Knowledge of emerging technologies relevant to AI/HPC (e.g., GPUDirect Storage, CXL, SmartNICs).
  • Contributions to open-source HPC or AI systems projects.
  • Relevant industry certifications (e.g., NVIDIA DGX, CKA) or publications in HPC/AI systems venues (SC, ISC, MLSys).

 

 

 

 

Apply Now:

Drag & Drop Files, Choose Files to Upload
Drag & Drop Files, Choose Files to Upload