GPU Cluster System Engineer

Home Working at MBZUAI Vacancies GPU Cluster System Engineer

Vacancy Overview

Application Open:

Full-Time

Job Purpose:

The GPU Cluster System Engineer will play a role at the forefront of optimizing performance for GPU clusters and help ensure the healthiness of the cluster. The ideal candidate will have a strong background in GPU architecture, parallel computing, and hands-on experience in system level performance tuning, debug methodologies, and data center management.

Key Responsibilities:

Performance Optimization: Collaborate with hardware and software teams to enhance the overall performance of GPU clusters, focusing on aspects such as throughput, latency, and GPU utilization.
Benchmarking and Analysis: Develop and execute comprehensive benchmarking strategies to assess baseline performance, analyze bottlenecks, and identify areas for improvement within GPU cluster environments.
Cluster Stability: Evaluate the stability of GPU clusters by conducting thorough testing under various workloads, ensuring optimal performance across of the cluster at the large scales.
Lead and manage the growth of our open-source community from projects such as LLM360.
Performance Profiling: Utilize profiling tools and methodologies to analyze and identify performance bottlenecks, providing actionable insights for improvement.
Performance Tuning: Implement optimization strategies, including but not limited to protocol enhancements, load balancing techniques, and parallel processing optimizations.
Documentation: Create detailed documentation of performance analysis, tuning efforts, and outcomes, providing clear and concise reports for internal teams and stakeholders.
Collaboration: Work closely with cross-functional teams, including hardware engineers, software developers, and system architects, to integrate performance improvements into the GPU cluster architecture.
Continuous Learning: Stay current with the latest developments in GPU architectures, parallel processing, and emerging technologies to drive continuous improvement in GPU cluster performance.

Academic Qualifications:

Bachelor’s Degree or higher in a related technical field (computer science, high performance computing) or the equivalent professional experience.

Professional Experience:

Proven experience in optimizing the performance of GPU clusters.
Strong understanding of GPU architectures, parallel computing concepts, and network protocols.
Proficiency in scripting languages (e.g., Python, Bash) for automation and performance analysis.
Experience with system level performance analysis tools and methodologies for GPU clusters.
Analytical mindset with excellent problem-solving and debug skills.
Familiarity with cluster management tools and systems.
Excellent communication and collaboration skills for effective teamwork.
RDMA network configuration, troubleshooting and performance tuning.
Linux kernel networking expertise.
Machine learning and/or HPC system design.

Apply Now:

Highest Related Position

First Name

Last Name

Phone

Highest Qualification

Number of Years of Experience in Related Position

Nationality

Upload CV

Click or drag a file to this area to upload.

Upload Cover Letter

Click or drag a file to this area to upload.