Senior MLOps Engineer

Home Working at MBZUAI Vacancies Senior MLOps Engineer

Vacancy Overview

Application Open:

Full-Time

 

MBZUAI is looking to recruit a Senior MLOps Engineer for our Institute of Foundation Models department. The Institute of Foundation Models (IFM) at MBZUAI is dedicated to pioneering academic research at the forefront of global AI innovation, driven by real-world societal needs. IFM builds some of the world’s most powerful foundation models – open, fast, and focused on solving real-world problems. With deep scientific roots and world-class talent in Abu Dhabi, Paris, and Silicon Valley, IFM is shaping the future of AI.

 

The MLOPs Engineer will design, build, and maintain robust ML (Machine Learning) infrastructure across training, inference, and deployment pipelines. The role will take ownership of the model lifecycle from data ingestion to real-time serving and ensure Large Language Models (LLM) and speech models are deployed efficiently, securely, and reproducibly in Kubernetes-based environments.

 

This position requires hands-on experience with Kubernetes (EKS), Helm, AWS Cloud Infrastructure, and modern MLOps toolchains (e.g., vLLM, SGLang, OpenWebUI, Weights & Biases, MLflow). Familiarity with speech/voice AI frameworks like ElevenLabs, Whisper, and RVC is also valuable. 

 

Key Experience Required

Infrastructure Design and Cloud Management 

  • Design, build, and maintain scalable ML infrastructure on AWS (EKS, EC2, RDS, S3, IAM), Azure, or GCP to support AI and data-intensive workloads. 
  • Deploy and manage Kubernetes clusters using Helm, ArgoCD, and Terraform for reproducible and secure environments. 
  • Ensure observability, cost optimization, and reliability of multi-environment cloud resources with integrated monitoring (Prometheus, Grafana). 

MLOps and Pipeline Automation

  • Develop and maintain automated MLOps pipelines for data versioning, model validation, and deployment using GitHub Actions, Jenkins, or AWS CodePipeline. 
  • Implement and optimize high-throughput model serving pipelines using vLLM, TensorRT, SGLang, or similar frameworks. 
  • Manage CI/CD workflows for model and application releases, integrating continuous testing and rollback strategies. 
  • Support real-time multimodal inference workloads (voice, text, vision) across distributed clusters. 

Infrastructure as Code and System Automation

  • Implement Infrastructure as Code (IaC) using Terraform, Helm, and Ansible for automated configuration, provisioning, and governance. 
  • Create and manage ISO images, operating systems, and environment rebuilds for consistency across environments. 
  • Automate workstation, server, and network configurations (DHCP, DNS, TLS) across on-premises and cloud systems. 

GPU and ML Environment Support

  • Set up and maintain GPU-accelerated environments with CUDA, cuDNN, PyTorch, NCCL, and relevant AI/ML libraries. 
  • Support containerized GPU workloads using Kubernetes GPU operators and optimize performance for LLM and TTS inference. 

Application Deployment and Monitoring

  • Deploy and manage production-ready AI/ML applications with OpenWebUI, Gradio, or similar front-end interfaces for internal and external demos. 
  • Monitor and troubleshoot performance, resource utilization, and reliability; ensure proactive alerting and fault resolution. 

Security, Compliance, and Reliability

  • Implement and enforce security best practices across infrastructure, data pipelines, and applications. 
  • Design and maintain disaster recovery, backup, and data protection strategies for critical systems. 
  • Ensure compliance with institutional and regulatory standards for data integrity and system resilience. 

Collaboration and Integration

  • Collaborate closely with ML researchers, AI engineers, and data scientists to productize and scale AI models (LLMs, ASR, TTS). 
  • Coordinate with cross-functional teams for project deployment, performance benchmarking, and workflow optimization. 

Innovation and Continuous Improvement

  • Evaluate and integrate emerging DevOps, MLOps, and cloud-native technologies to enhance automation and scalability. 
  • Optimize cloud and hardware resource utilization to achieve operational efficiency and cost reduction. 

Documentation and Knowledge Transfer

  • Maintain comprehensive documentation of infrastructure architectures, deployment processes, and operational workflows. 
  • Mentor junior engineers and promote best practices in DevOps, MLOps, and secure infrastructure management. 

 

Academic Qualification

  • Bachelor’s degree in Computer Science, AI Systems Engineering, or a related field. 

 

Professional Experience Required
Essential:

  • Minimum of 4 years of experience in MLOps, DevOps, or Cloud Infrastructure Engineering for ML systems. 
  • Strong proficiency in Kubernetes, Helm, and container orchestration. 
  • Experience deploying ML models via vLLM, SGLang, TensorRT, or Ray Serve. 
  • Proficiency with AWS services (EKS, EC2, S3, RDS, CloudWatch, IAM). 
  • Solid experience with Python, Docker, Git, and CI/CD pipelines. 
  • Strong understanding of model lifecycle management, data pipelines, and observability tools (Grafana, Prometheus, Loki). 
  • Excellent collaboration skills with ML researchers and software engineers. 

Preferred Experience Required

  • Extensive Experience with vLLM, K8s, Elevenlabs, Whisper, Gradio/OpenWebUI, or custom TTS/ASR model hosting. 
  • Familiarity with multi-GPU scheduling, NCCL optimization, and HPC cluster integration. 
  • Knowledge of security, cost management, and network policy in multi-tenant Kubernetes clusters and cloudflare systems. 
  • Prior work in LLM deployment, fine-tuning pipelines, or foundation model research. 
  • Exposure to data governance and responsible AI operations in research or enterprise settings. 

Apply Now:

Drag & Drop Files, Choose Files to Upload
Drag & Drop Files, Choose Files to Upload