Full-Time
MBZUAI’s Institute of Foundation Models is seeking a Senior HPC Engineer to provide technical leadership in designing, operating, and evolving large-scale GPU infrastructure supporting frontier AI research. The Institute for Foundation Models (IFM) operates one of the world’s largest AI-focused supercomputing environments and is looking for an experienced HPC Engineer to contribute to groundbreaking research and development.
Key Responsibilities
• Lead operation and optimization of large-scale GPU clusters.
• Drive reliability, scalability, and performance improvements.
• Lead troubleshooting and root cause analysis of complex issues.
• Design and validate new cluster deployments and upgrades.
• Collaborate with researchers to optimize distributed AI training.
• Lead vendor engagement and technical reviews.
• Mentor junior engineers.
• Define monitoring, operational standards, and capacity planning processes.
• Participate in major incident management and escalations.
Academic Qualification
Professional Experience Required
Essential:
• 5+ years in HPC, Linux infrastructure, cloud infrastructure, distributed systems, or large-scale production environments.
• Experience with Slurm and Linux administration.
• Experience troubleshooting compute, storage, and networking systems.
Preferred:
• GPU cluster operations.
• NVIDIA technologies including CUDA, NCCL, NVLink, and GPUDirect.
• InfiniBand networking.
• Weka, Lustre, BeeGFS, or similar storage platforms.
• Azure, AWS, or GCP.
• Terraform, Ansible, or Infrastructure-as-Code.
• PyTorch Distributed, Megatron-LM, DeepSpeed, FSDP, or large-scale AI training environments.