Senior High Performance Computing Engineer

Home Working at MBZUAI Vacancies Senior High Performance Computing Engineer

Vacancy Overview

Application Open:

Full-Time

Job Purpose:

As a Senior HPC Engineer at MBZUAI, the individual will be the driving force behind the university’s computational transformation, empowering world-class researchers to push the boundaries of deep learning and artificial intelligence. Leveraging expertise in GPU cluster management and distributed computing, the engineer will architect and optimize a state-of-the-art HPC infrastructure that enables the efficient training of large-scale neural networks.

By seamlessly integrating cutting-edge parallel processing capabilities into the research workflows, the engineer will accelerate the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem-solving skills will be instrumental in establishing MBZUAI as a global hub for high-performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers.

Key Responsibilities:

Cloud & On-Prem HPC Infrastructure Management and Optimization
- Design, deploy, and maintain state-of-the-art GPU clusters to support the computational demands of MBZUAI’s research initiatives.
- Continuously monitor and optimize the performance, reliability, and scalability of the HPC infrastructure to ensure maximum uptime and efficiency.
- Implement advanced resource management strategies to efficiently allocate computing resources based on project priorities and emerging research needs.
- Ensure the HPC environment adheres to the highest standards of security, compliance, and energy efficiency to safeguard sensitive data and minimize environmental impact.
- Develop and oversee comprehensive system monitoring and alerting mechanisms to proactively identify and resolve issues before they impact research workflows.
Distributed and Parallel Deep Learning Capabilities
- Spearhead the implementation of distributed computing techniques to enable parallel training of large-scale deep learning models across multiple GPUs and nodes.
- Collaborate closely with data scientists and machine learning engineers to seamlessly integrate distributed training capabilities into existing deep learning frameworks (e.g., TensorFlow, PyTorch, MXNet).
- Develop and optimize data distribution and synchronization strategies to achieve faster model convergence and reduced training times, empowering researchers to explore more ambitious AI projects.
- Continuously research and implement the latest advancements in distributed deep learning to maintain MBZUAI’s competitive edge and ensure the institution remains at the forefront of cutting-edge AI research.
- Provide comprehensive training and support to research teams on the effective utilization of distributed deep-learning workflows.
Performance Engineering and Profiling
- Utilize advanced profiling tools and techniques to identify and resolve performance bottlenecks within the HPC environment, ensuring researchers can maximize the productivity of their computational resources.
- Fine-tune GPU clusters and deep learning frameworks to deliver optimal performance for specific research workloads and use cases, driving breakthroughs in areas such as computer vision, natural language processing, and scientific computing.
- Implement innovative strategies to maximize the utilization and efficiency of MBZUAI’s computing resources, including techniques like dynamic resource allocation and workload prioritization.
- Develop and maintain a comprehensive knowledge base of performance optimization best practices, promoting cross-functional collaboration and knowledge sharing among HPC, data science, and machine learning teams.
Thought Leadership and Collaboration
- Proactively engage with the global HPC and deep learning research communities to stay abreast of the latest trends and innovations and identify opportunities for MBZUAI to contribute to and influence the direction of the field.
- Contribute to the development of MBZUAI’s strategic roadmap for HPC and deep learning infrastructure, ensuring the institution maintains a cutting-edge technological advantage and supports the evolving needs of its research programs.
- Mentor and train junior HPC engineers, fostering a culture of continuous learning and knowledge sharing to build a sustainable and highly skilled HPC team.
- Represent MBZUAI at industry conferences and events, showcasing the institution’s cutting-edge HPC and deep learning capabilities and establishing MBZUAI as a global leader in AI research and innovation.
Technical Expertise and Innovation
- Proficiency in at least one of the public Cloud HPC services. Preferably AWS SageMaker HyperPod.
- Generic DevOps skills like TerraForm, CloudFormation, and Ansible.
- Solid System Administration skills including VMs, VPC, object & block storage, VPN, Firewall, IAM, etc.
- Maintain in-depth knowledge of the latest advancements in GPU hardware, deep learning frameworks, and parallel computing technologies.
- Explore and evaluate emerging HPC and deep learning trends, and champion the adoption of innovative solutions that can drive MBZUAI’s research agenda forward.
- Collaborate with cross-functional teams to develop novel approaches and methodologies for optimizing the performance and utilization of MBZUAI’s HPC resources.
- Contribute to the development of intellectual property and thought leadership in the field of high-performance computing for deep learning.
Other Duties
- Perform all other duties as reasonably directed by the line manager that are commensurate with these functional objectives.

Academic Qualification:

Bachelor’s degree in Computer Science, Electrical Engineering, or a related technical field with a focus on High-Performance Computing, Parallel Processing, or Distributed System or Deep Learning.

Professional Experience:

3+ years of proven experience in managing GPU clusters, including installation, configuration, and optimization.
Extensive experience in the design, deployment, and management of GPU clusters, including configuration, monitoring, and performance optimization.
Proven expertise in implementing distributed computing techniques and parallel training strategies for deep learning models.
Strong proficiency in popular deep learning frameworks (e.g., TensorFlow, PyTorch, MXNet) and GPU-accelerated programming (e.g., CUDA, cuDNN).
Hands-on experience with performance profiling and optimization tools for HPC and deep learning.
Knowledge of resource management and scheduling systems (e.g., SLURM, Kubernetes).
Excellent problem-solving and troubleshooting skills to address complex technical challenges.
Effective communication and collaboration skills to work with cross functional teams.

Apply Now:

Please enable JavaScript in your browser to complete this form.

CV Applied Nationality

First Name

Last Name

Phone

Highest Qualification

Number of Years of Experience in Related Position

Nationality

Upload CV

Click or drag a file to this area to upload.

Upload Cover Letter

Click or drag a file to this area to upload.