Application Open:
Full-Time
Job Purpose:
MBZUAI is seeking a highly skilled High-Performance Computing (HPC) System Engineer to oversee and optimize its HPC infrastructure. This role is responsible for administering HPC clusters, managing job scheduling, optimizing resource allocation, ensuring system security, and fine-tuning hardware performance. The ideal candidate will have expertise in Slurm, Kubernetes, Linux system administration, and automation tools to maintain seamless operations and ensure high availability of HPC resources.
Key Responsibilities:
HPC Infrastructure Deployment & Maintenance:
- Install, configure, and maintain HPC clusters, high-speed storage systems, and networking.
- Manage compute nodes, GPUs, high-speed interconnects, and parallel file systems.
- Implement monitoring tools (Prometheus, Grafana, Nagios) to track system performance.
- Optimize cluster scalability and performance for growing research demands.
Resource & Job Scheduling Optimization:
- Administer and fine-tune Slurm workload manager to optimize job scheduling and resource allocation.
- Implement Kubernetes for containerized HPC workloads and scalable orchestration.
- Develop automated policies for fair resource distribution and efficiency.
- Work with researchers to optimize their computational workloads.
System Performance & Security:
- Conduct system performance tuning, kernel optimizations, and parallel computing enhancements.
- Apply Linux security best practices, access control measures, and compliance policies.
- Develop automated scripts and tools for system diagnostics and failure recovery.
User Support & Documentation:
- Provide technical support for HPC users, assisting with job submissions, debugging, and resource optimization.
- Develop and maintain comprehensive HPC documentation, FAQs, and user guides.
- Conduct training sessions and workshops on best practices for HPC utilization.
Collaboration & Future Planning:
- Work closely with Research teams, IT and vendors to enhance HPC capabilities.
- Stay updated with emerging HPC technologies, AI/ML acceleration, and distributed computing advancements.
- Assist in long-term infrastructure planning, hardware refreshes, and performance benchmarking.
Other Duties:
- Perform all other duties as reasonably directed by the line manager that are commensurate with these functional objectives.
Academic Qualifications:
- Bachelor’s degree in Computer Science, Engineering, or a related field.
- A postgraduate degree will be preferred.
Professional Experience:
Essential
- 3+ years of experience in HPC system administration, resource management, or distributed computing.
- Proficiency in Linux system administration (Ubuntu, CentOS, RHEL).
- Experience with Slurm, Kubernetes, Docker/Singularity for job scheduling and orchestration.
- Hands-on experience with automation tools (Ansible, Terraform, Puppet).
- Experience with high-speed storage solutions (Lustre, ZFS, Ceph) and networking (Infiniband, RDMA).
- Strong problem-solving skills in debugging, profiling, and optimizing computational applications.
- Excellent English communication skills, a collaborative attitude, and the ability to work effectively with engineers at all levels.
- Experience with source control systems, build tools, and continuous integration pipelines.
- Hardworking, self-motivated, detail-oriented, and proven ability to meet tight deadlines.
Preferred
- A PhD degree, with 2+ years of equivalent practice or research experience, will be preferred.
- Background in scientific research or domain-specific computing.
- Contributions to open-source scientific computing projects.
- Experience in higher education or research institutions, with an understanding of core research facility operations.
- Proficiency in data analytics for process optimization and continuous improvement.
- Working proficiency in additional languages as a plus.