Distributed ML Engineer

Home Working at MBZUAI Vacancies Distributed ML Engineer

Vacancy Overview

Application Open:

Full-Time

Job Purpose:

The Distributed ML Engineer will play a role at the forefront of optimizing performance for the machine learning software stacks, especially at training and inference, and support the team to develop new and cutting-edge systems. The ideal candidate will have a strong background in parallel computing, and hands-on experience in system level coding, debug methodologies, and large-scale machine learning experience.

Key Responsibilities:

Understand, analyze, profile, optimize, and provide guidance to the team on deep learning workloads on state-of-the-art hardware and software platforms to improve their efficiency with different levels of optimization.
Design and implement performance benchmarks and testing methodologies to evaluate application performance.
Build tools to automate workload analysis, workload optimization, and other critical workflows.
Triage system issues and identify bottleneck and inefficiencies by analyzing the sources of issues and the impact on hardware, network and propose solutions to enhance GPU utilization.
Support the team to develop appropriate kernels and systems for new model architectures and algorithms.
Participate in, or lead design reviews with peers and stakeholders to decide amongst available technologies.
Review code developed by other developers and provide feedback to ensure best practices (e.g., style guidelines, checking code in, accuracy, testability, and efficiency).
Contribute to existing documentation or educational content and adapt content based on product/program updates and user feedback.
Represent MBZUAI at industry conferences and events, showcasing the institution’s cutting-edge HPC and deep learning capabilities and establishing MBZUAI as a global leader in AI research and innovation.
Perform all other duties as reasonably directed by the line manager that are commensurate with these functional objectives.

Academic Qualifications:

Ph.D. in CS, EE or CSEE with 1+ years working experience/ Masters in CS, EE or CSEE or equivalent experience with 2+ year working experience.

Minimum Professional Experience:

Background in deep learning model architectures and experience with Pytorch and large-scale distributed training.
Proficiency in Python and C/C++ for analyzing and optimizing code.
Excellent problem-solving and troubleshooting skills to address complex technical challenges.
Effective communication and collaboration skills to work with cross functional teams.
Experience using multi node GPU infrastructure.

Preferred Professional Experience:

Expertise in designing, analyzing, and troubleshooting large-scale distributed systems.
Have a deep understanding of GPU, CPU, or other AI accelerator architectures.
Have experience writing and optimizing compute kernels with CUDA/Triton or similar languages.
Are familiar with LLM architectures and training infrastructure.
Have experience driving ML accuracy with low-precision formats.
Have 3+ years of relevant industry experience.
Experience in performance optimization of large-scale distributed systems.
Systematic problem-solving approach, coupled with effective verbal and written communication skills.

Apply Now:

Please enable JavaScript in your browser to complete this form.

Highest Experience Years

First Name

Last Name

Phone

Highest Qualification

Number of Years of Experience in Related Position

Nationality

Upload CV

Click or drag a file to this area to upload.

Upload Cover Letter

Click or drag a file to this area to upload.