Application Open:
Full-Time
MBZUAI is seeking a highly skilled HPC Network Engineer to design, implement, and operate the high-performance networking infrastructure that underpins the university’s research computing environment.
This role is critical to ensuring reliable, low-latency, and high-bandwidth connectivity across GPU and CPU clusters, parallel storage systems, and research platforms supporting large-scale AI/ML and robotics workloads. The position focuses on network architecture, optimization, monitoring, and troubleshooting for HPC environments, enabling researchers to operate at scale while ensuring performance, resilience, security, and compliance across all HPC facilities.
Key Responsibilities:
HPC Network Architecture & Engineering
- Design, deploy, and maintain high-performance network architectures for HPC clusters, GPU servers, CPU nodes, and parallel storage systems.
- Configure and optimize high-speed interconnects, including InfiniBand, RoCE, and high-speed Ethernet (25/100/200GbE+), to support low-latency and high-throughput workloads.
- Design network topologies optimized for MPI traffic, NCCL collectives, and large-scale data transfers.
- Integrate networking solutions with parallel file systems such as Lustre, BeeGFS, or GPFS.
Network Operations, Monitoring & Troubleshooting
- Monitor network performance, capacity, and availability across all HPC facilities.
- Diagnose and resolve complex network issues affecting compute, storage, and distributed training workloads.
- Implement performance monitoring, alerting, and diagnostics using HPC-specific networking tools.
- Ensure maximum uptime and performance for research computing resources.
Security, Compliance & Reliability
- Implement and maintain network security controls aligned with data center and institutional standards.
- Ensure compliance with internal policies, safety requirements, and regulatory obligations.
- Develop preventive maintenance procedures and support disaster recovery and resilience planning for network infrastructure.
Upgrades, Capacity Planning & Innovation
- Plan and execute network upgrades, expansions, and technology refreshes with minimal disruption to research activities.
- Support capacity planning and forecasting for growing AI/HPC workloads.
- Evaluate emerging networking technologies relevant to AI and HPC (e.g., SmartNICs, CXL, GPUDirect RDMA).
Documentation & Collaboration
- Develop and maintain detailed network documentation, architecture diagrams, configuration records, and operational procedures.
- Collaborate with HPC system engineers, storage architects, MLOps, and research teams to ensure end-to-end system performance.
- Provide expert-level support and guidance on network-related issues to internal stakeholders.
Professional Experience Required
Essential:
- Minimum 5 years of experience in network engineering, with at least 3 years in HPC or research computing environments.
- Extensive hands-on experience with high-performance networking technologies such as InfiniBand, Omni-Path, RoCE, or high-speed Ethernet.
- Proven expertise configuring and troubleshooting network infrastructure for parallel file systems (e.g., Lustre, GPFS, BeeGFS).
- Strong understanding of data-center networking concepts, including routing, switching, VLANs, RDMA, and network security.
- Experience designing networks optimized for MPI workloads and large-scale distributed AI training.
- Proficiency with network monitoring and diagnostic tools in HPC environments.
- Ability to work in a demanding, service-oriented environment with strong organization, communication, and collaboration skills.
Preferred:
- Experience with software-defined networking (SDN) in HPC contexts.
- Professional certifications such as CCNP, CCIE, or equivalent.
- Experience supporting HPC environments in academic or research institutions.
- Exposure to GPU-centric networking architectures and NVIDIA networking technologies.