Distributed ML Engineer

Home Working at MBZUAI Vacancies Distributed ML Engineer

Vacancy Overview

Application Open:

Full-Time

Job Purpose:

The Distributed ML Engineer will play a role at the forefront of optimizing performance for the machine learning software stacks, especially at training and inference, and support the team to develop new and cutting-edge systems. The ideal candidate will have a strong background in parallel computing, and hands-on experience in system level coding, debug methodologies, and large-scale machine learning experience.

Key Responsibilities:

  • Understand, analyze, profile, optimize, and provide guidance to the team on deep learning workloads on state-of-the-art hardware and software platforms to improve their efficiency with different levels of optimization.
  • Design and implement performance benchmarks and testing methodologies to evaluate application performance.
  • Build tools to automate workload analysis, workload optimization, and other critical workflows.
  • Triage system issues and identify bottleneck and inefficiencies by analyzing the sources of issues and the impact on hardware, network and propose solutions to enhance GPU utilization.
  • Support the team to develop appropriate kernels and systems for new model architectures and algorithms.
  • Participate in, or lead design reviews with peers and stakeholders to decide amongst available technologies.
  • Review code developed by other developers and provide feedback to ensure best practices (e.g., style guidelines, checking code in, accuracy, testability, and efficiency).
  • Contribute to existing documentation or educational content and adapt content based on product/program updates and user feedback.
  • Represent MBZUAI at industry conferences and events, showcasing the institution’s cutting-edge HPC and deep learning capabilities and establishing MBZUAI as a global leader in AI research and innovation.
  • Perform all other duties as reasonably directed by the line manager that are commensurate with these functional objectives.

Academic Qualifications:

  • Ph.D. in CS, EE or CSEE with 1+ years working experience/ Masters in CS, EE or CSEE or equivalent experience with 2+ year working experience.

Minimum Professional Experience:

  • Background in deep learning model architectures and experience with Pytorch and large-scale distributed training.
  • Proficiency in Python and C/C++ for analyzing and optimizing code.
  • Excellent problem-solving and troubleshooting skills to address complex technical challenges.
  • Effective communication and collaboration skills to work with cross functional teams.
  • Experience using multi node GPU infrastructure.

Preferred Professional Experience:

  • Expertise in designing, analyzing, and troubleshooting large-scale distributed systems.
  • Have a deep understanding of GPU, CPU, or other AI accelerator architectures.
  • Have experience writing and optimizing compute kernels with CUDA/Triton or similar languages.
  • Are familiar with LLM architectures and training infrastructure.
  • Have experience driving ML accuracy with low-precision formats.
  • Have 3+ years of relevant industry experience.
  • Experience in performance optimization of large-scale distributed systems.
  • Systematic problem-solving approach, coupled with effective verbal and written communication skills.

Apply Now:

Please enable JavaScript in your browser to complete this form.
Click or drag a file to this area to upload.
Click or drag a file to this area to upload.