HPC Engineer – Core Facilities

Home Working at MBZUAI Vacancies HPC Engineer – Core Facilities

Vacancy Overview

Application Open:

Full-Time

MBZUAI is seeking a highly skilled and innovative HPC Engineer with a strong hands-on technical experience to join the Robot Learning Laboratory, a state-of-the-art core facility. This role is ideal for candidates who have a strong background in HPC and computer engineering, with a passion for GPU cluster management, distributed training optimization, and system performance tuning to support cutting-edge research at scale.

With an emphasis on practical implementation complemented by research-driven innovation, this position focuses on design, deploying, and optimizing high-performance computing infrastructure for AI/ML workloads and robotics research.

Key Responsibilities

HPC Cluster Architecture and Management

Design and deploy GPU cluster architecture for large-scale AI/ML training
Configure and manage HPC schedulers (SLURM/PBS/Kubernetes) for resource allocation
Implement fair-share policies, GPU quotas, and priority queues
Design high-speed network topology (InfiniBand/RoCE) for optimal performance
Plan capacity and resource utilization for growing compute demands
Integrate HPC cluster with data storage systems (Lustre/BeeGFS/Ceph)

Performance Tuning and Optimization

Benchmark and optimize GPU utilization for deep learning frameworks (PyTorch/TensorFlow)
Tune distributed training performance (NCCL, Horovod, DeepSpeed, Megatron)
Optimize data loading pipelines and checkpoint/restore operations
Profile and analyze system bottlenecks (CPU, GPU, memory, network, I/O)
Implement performance monitoring and alerting for compute resources
Optimize MPI and collective communication patterns

Distributed Training Support

Support researchers with distributed training job configuration
Troubleshoot multi-node training failures and performance issues
Implement best practices for gradient synchronization and model parallelism
Optimize batch sizes, learning rates, and scaling strategies
Develop tools and scripts for job submission and monitoring
Create documentation and training materials for HPC users

System Administration and Maintenance

Install and configure GPU drivers, CUDA, cuDNN, and ML frameworks
Manage system software stack (OS, libraries, compilers, tools)
Perform system health checks and preventive maintenance
Handle hardware failures and coordinate with vendors
Implement backup and disaster recovery procedures
Maintain system security and access controls

Collaboration and Innovation

Work with data architects on storage-compute integration
Collaborate with MLOps engineers on training pipeline optimization
Assist AIOps engineers with intelligent resource scheduling
Evaluate emerging HPC technologies (CXL, GPU Direct Storage, SmartNICs)
Participate in architecture reviews and capacity planning
Share knowledge through technical presentations and documentation

Academic Qualifications Required

Master’s degree in Computer Science, Engineering, or a related field.
A PhD degree will be preferred.

Professional Experience Required
Essential:

3+ years of experience in HPC system administration, resource management, or distributed computing.
Demonstrated hands-on experience with high-performance computing (HPC), including managing GPU-based clusters for AI/ML workloads.
Strong Linux system administration skills (e.g., RHEL, Ubuntu) and ability to operate in large-scale compute environments.
Proficiency with HPC workload schedulers such as Slurm (preferred) or PBS, including job configuration, resource management, and troubleshooting.
Solid understanding of GPU computing, particularly NVIDIA GPUs, CUDA, and cuDNN, and how to optimize AI/ML workloads for them.
Experience with distributed training using frameworks such as PyTorch or TensorFlow, and related communication libraries (e.g., NCCL, Horovod).
Scripting and automation proficiency in Python and Bash to support deployment, monitoring, and operational tasks.
Experience with high-speed networking concepts and technologies (e.g., InfiniBand, RoCE, RDMA).
Familiarity with parallel file systems such as Lustre, BeeGFS, or GPFS to support large-scale AI workloads.
Ability to work in a demanding, service-oriented environment with strong organization, communication, and collaboration skills.

Preferred Skills

Experience with container orchestration for HPC (e.g., Kubernetes) and/or running AI/ML workloads in containerized environments.
Experience with cloud-based HPC solutions (AWS ParallelCluster, Azure CycleCloud, or GCP HPC).
Use of performance and profiling tools (e.g., nvprof, nsys, DCGM, Grafana) for workload optimization and system tuning.
Exposure to advanced AI training frameworks such as DeepSpeed or Megatron-LM for large-model training.
Experience with benchmarking frameworks (MLPerf, HPL, HPCG) for system validation and performance characterization.
Knowledge of emerging technologies relevant to AI/HPC (e.g., GPUDirect Storage, CXL, SmartNICs).
Contributions to open-source HPC or AI systems projects.
Relevant industry certifications (e.g., NVIDIA DGX, CKA) or publications in HPC/AI systems venues (SC, ISC, MLSys).

Apply Now:

First Name

Last Name

Phone

Highest Qualification

Number of Years of Experience in Related Position

Nationality

Years CV Nationality

Upload CV

Drag & Drop Files, Choose Files to Upload

Upload Cover Letter

Drag & Drop Files, Choose Files to Upload