GPU Cluster System Engineer

Home Working at MBZUAI Vacancies GPU Cluster System Engineer

Vacancy Overview

Application Open:

Full-Time

Job Purpose:

The GPU Cluster System Engineer will play a role at the forefront of optimizing performance for GPU clusters and help ensure the healthiness of the cluster. The ideal candidate will have a strong background in GPU architecture, parallel computing, and hands-on experience in system level performance tuning, debug methodologies, and data center management.

Key Responsibilities:

  • Performance Optimization: Collaborate with hardware and software teams to enhance the overall performance of GPU clusters, focusing on aspects such as throughput, latency, and GPU utilization.
  • Benchmarking and Analysis: Develop and execute comprehensive benchmarking strategies to assess baseline performance, analyze bottlenecks, and identify areas for improvement within GPU cluster environments.
  • Cluster Stability: Evaluate the stability of GPU clusters by conducting thorough testing under various workloads, ensuring optimal performance across of the cluster at the large scales.
  • Lead and manage the growth of our open-source community from projects such as LLM360.
  • Performance Profiling: Utilize profiling tools and methodologies to analyze and identify performance bottlenecks, providing actionable insights for improvement.
  • Performance Tuning: Implement optimization strategies, including but not limited to protocol enhancements, load balancing techniques, and parallel processing optimizations.
  • Documentation: Create detailed documentation of performance analysis, tuning efforts, and outcomes, providing clear and concise reports for internal teams and stakeholders.
  • Collaboration: Work closely with cross-functional teams, including hardware engineers, software developers, and system architects, to integrate performance improvements into the GPU cluster architecture.
  • Continuous Learning: Stay current with the latest developments in GPU architectures, parallel processing, and emerging technologies to drive continuous improvement in GPU cluster performance.

Academic Qualifications:

  • Bachelor’s Degree or higher in a related technical field (computer science, high performance computing) or the equivalent professional experience.

Professional Experience:

  • Proven experience in optimizing the performance of GPU clusters.
  • Strong understanding of GPU architectures, parallel computing concepts, and network protocols.
  • Proficiency in scripting languages (e.g., Python, Bash) for automation and performance analysis.
  • Experience with system level performance analysis tools and methodologies for GPU clusters.
  • Analytical mindset with excellent problem-solving and debug skills.
  • Familiarity with cluster management tools and systems.
  • Excellent communication and collaboration skills for effective teamwork.
  • RDMA network configuration, troubleshooting and performance tuning.
  • Linux kernel networking expertise.
  • Machine learning and/or HPC system design.

Apply Now:

Please enable JavaScript in your browser to complete this form.
Click or drag a file to this area to upload.
Click or drag a file to this area to upload.