Application Open:
Full-Time
Job Purpose:
The Distributed ML Engineer will play a role at the forefront of optimizing performance for the machine learning software stacks, especially at training and inference, and support the team to develop new and cutting-edge systems. The ideal candidate will have a strong background in parallel computing, and hands-on experience in system level coding, debug methodologies, and large-scale machine learning experience.
Key Responsibilities:
- Understand, analyze, profile, optimize, and provide guidance to the team on deep learning workloads on state-of-the-art hardware and software platforms to improve their efficiency with different levels of optimization.
- Design and implement performance benchmarks and testing methodologies to evaluate application performance.
- Build tools to automate workload analysis, workload optimization, and other critical workflows.
- Triage system issues and identify bottleneck and inefficiencies by analyzing the sources of issues and the impact on hardware, network and propose solutions to enhance GPU utilization.
- Support the team to develop appropriate kernels and systems for new model architectures and algorithms.
- Participate in, or lead design reviews with peers and stakeholders to decide amongst available technologies.
- Review code developed by other developers and provide feedback to ensure best practices (e.g., style guidelines, checking code in, accuracy, testability, and efficiency).
- Contribute to existing documentation or educational content and adapt content based on product/program updates and user feedback.
- Represent MBZUAI at industry conferences and events, showcasing the institution’s cutting-edge HPC and deep learning capabilities and establishing MBZUAI as a global leader in AI research and innovation.
- Perform all other duties as reasonably directed by the line manager that are commensurate with these functional objectives.
Academic Qualifications:
- Ph.D. in CS, EE or CSEE with 1+ years working experience/ Masters in CS, EE or CSEE or equivalent experience with 2+ year working experience.
Minimum Professional Experience:
- Background in deep learning model architectures and experience with Pytorch and large-scale distributed training.
- Proficiency in Python and C/C++ for analyzing and optimizing code.
- Excellent problem-solving and troubleshooting skills to address complex technical challenges.
- Effective communication and collaboration skills to work with cross functional teams.
- Experience using multi node GPU infrastructure.
Preferred Professional Experience:
- Expertise in designing, analyzing, and troubleshooting large-scale distributed systems.
- Have a deep understanding of GPU, CPU, or other AI accelerator architectures.
- Have experience writing and optimizing compute kernels with CUDA/Triton or similar languages.
- Are familiar with LLM architectures and training infrastructure.
- Have experience driving ML accuracy with low-precision formats.
- Have 3+ years of relevant industry experience.
- Experience in performance optimization of large-scale distributed systems.
- Systematic problem-solving approach, coupled with effective verbal and written communication skills.