Application Open:
Full-Time
Job Purpose:
The GPU Cluster System Engineer will play a role at the forefront of optimizing performance for GPU clusters and help ensure the healthiness of the cluster. The ideal candidate will have a strong background in GPU architecture, parallel computing, and hands-on experience in system level performance tuning, debug methodologies, and data center management.
Key Responsibilities:
- Performance Optimization: Collaborate with hardware and software teams to enhance the overall performance of GPU clusters, focusing on aspects such as throughput, latency, and GPU utilization.
- Benchmarking and Analysis: Develop and execute comprehensive benchmarking strategies to assess baseline performance, analyze bottlenecks, and identify areas for improvement within GPU cluster environments.
- Cluster Stability: Evaluate the stability of GPU clusters by conducting thorough testing under various workloads, ensuring optimal performance across of the cluster at the large scales.
- Lead and manage the growth of our open-source community from projects such as LLM360.
- Performance Profiling: Utilize profiling tools and methodologies to analyze and identify performance bottlenecks, providing actionable insights for improvement.
- Performance Tuning: Implement optimization strategies, including but not limited to protocol enhancements, load balancing techniques, and parallel processing optimizations.
- Documentation: Create detailed documentation of performance analysis, tuning efforts, and outcomes, providing clear and concise reports for internal teams and stakeholders.
- Collaboration: Work closely with cross-functional teams, including hardware engineers, software developers, and system architects, to integrate performance improvements into the GPU cluster architecture.
- Continuous Learning: Stay current with the latest developments in GPU architectures, parallel processing, and emerging technologies to drive continuous improvement in GPU cluster performance.
Academic Qualifications:
- Bachelor’s Degree or higher in a related technical field (computer science, high performance computing) or the equivalent professional experience.
Professional Experience:
- Proven experience in optimizing the performance of GPU clusters.
- Strong understanding of GPU architectures, parallel computing concepts, and network protocols.
- Proficiency in scripting languages (e.g., Python, Bash) for automation and performance analysis.
- Experience with system level performance analysis tools and methodologies for GPU clusters.
- Analytical mindset with excellent problem-solving and debug skills.
- Familiarity with cluster management tools and systems.
- Excellent communication and collaboration skills for effective teamwork.
- RDMA network configuration, troubleshooting and performance tuning.
- Linux kernel networking expertise.
- Machine learning and/or HPC system design.