Application Open:
Full-Time
Job Purpose:
The DevOps/MLOps Engineer to deliver high-quality results in a fast-paced environment. The role will support AI engineering and research teams by automating and optimizing systems, focusing on the accuracy, interpretability, and performance of machine learning solutions. The role includes developing automated systems for OS and configuration rebuilds, managing ISO images, and automating deployment and monitoring across workstations, mobile devices, and robotics. The role will configure network settings, automate server management with Ansible and Terraform, and maintain GPU and ML development environments (Python, Conda, CUDA, PyTorch, etc.). Additionally, the role will manage CI/CD pipelines, troubleshoot hardware issues, set up Kubernetes (MicroK8s, K3S, AWS EKS, Azure AKS, Google GKE), and support ML pipelines using Kubeflow, Argo, Ray, and KubeRay. This role offers an exciting opportunity to drive MBZUAI innovation at the intersection of DevOps and MLOps.
Key Responsibilities:
Infrastructure Automation and Management:
- Develop and maintain automated systems for OS and configuration rebuilds using Infrastructure as Code (IaC) principles.
- Create and manage ISO images for customized operating systems and software packages.
- Automate workstation and server management using tools like Ansible, Terraform, and cloud services (AWS, Azure, GCP).
Application Deployment and Monitoring:
- Build and maintain automated systems for application deployment and monitoring across workstations, mobile devices, robotics, and other platforms.
- Configure and manage network settings, including DHCP, DNS, and TLS certificates.
GPU and ML Environment Support:
- Set up and support GPU environments, including GPU operators, drivers, and container toolkits.
- Maintain ML development environments with Python, Conda, CUDA, PyTorch, cuDNN, NCCL, GCC, and libraries like transformers and scikit-learn, ensuring dependency conflicts and integrity are resolved.
CI/CD Pipeline Management:
- Manage and optimize CI/CD pipelines using GitHub Actions or similar tools.
Hardware and System Troubleshooting:
- Troubleshoot and resolve hardware issues, including CPU, memory, and disk-related problems.
Kubernetes and Cloud Orchestration:
- Deploy and manage Kubernetes clusters using MicroK8s, K3S, and cloud-managed services like AWS EKS, Azure AKS, and Google GKE.
ML Pipeline Development and Support:
- Develop and support ML automated pipelines using tools such as Kubeflow, Argo, Ray, and KubeRay.
Performance Monitoring and Optimization:
- Monitor system performance, implement logging and alerting mechanisms, and ensure high availability and scalability.
Collaboration and Security:
- Collaborate with AI engineers and researchers to optimize ML workflows and infrastructure.
- Implement security best practices for infrastructure, applications, and data pipelines.
Innovation and Continuous Improvement:
- Stay updated with emerging technologies and tools in DevOps, MLOps, and cloud computing.
- Optimize resource utilization and cost efficiency for cloud and on-premises infrastructure.
Documentation and Knowledge Sharing:
- Document processes, architectures, and configurations to ensure knowledge sharing and maintainability.
- Mentor junior team members and promote best practices in DevOps and MLOps.
Disaster Recovery and Backup:
- Design and implement disaster recovery and backup strategies for critical systems.
Production Integration:
- Support the integration of AI/ML models into production environments with a focus on scalability and reliability.
Other Duties:
- Perform all other duties as reasonably directed by the line manager that are commensurate with these functional objectives.
Academic Qualifications:
- Master’s degree in Computer Science, Information Systems or related field, with a specialization in Machine Learning, Deep Learning, Natural Language Processing, Computer Vision, or Software Engineering.
- Another postgraduate degree will be preferred.
Professional Experience:
Essential
- 5+ years of hands-on engineering experience maintaining AI platforms on Cloud/Cluster environments, leveraging popular AI accelerator architectures such as CPU, GPU, and NPU.
- 5+ years of solid experience in MLOps/DevOps, including managing model lifecycles, setting up environments, handling dependencies, model serving (real- time/batch), cost-aware resource auto-scaling in cloud environments, distributed training, and container technologies.
- Proficient in Docker, Docker Swarm, Docker Compose, Docker Registry, and AWS EKS.
- Skilled in Linux Ubuntu installation, configuration, and management.
- Familiar with hybrid hardware capabilities, including CPU and GPU configurations for workstations, mobile devices, and robotics.
- Experienced in network booting using DHCP, PXE, iPXE, and TFTP servers.
- Knowledgeable in network routing with tools like nginx, APISix, and ingress controllers.
- Proficient in TLS certificate generation using Cert Manager and Let’s Encrypt.
- Hands-on experience with provisioning tools such as Ansible and Terraform.
- Skilled in logging and monitoring tools like Grafana, Logstash, CloudWatch, and Prometheus.
- 5+ years of Kubernetes administration and development experience.
- Excellent English communication skills, with a collaborative attitude and the ability to work effectively with engineers at all levels.
- Strong sense of responsibility, with the ability to support emergent operational issues to ensure smooth team development and deployment. Experience in preventive maintenance, quality standards, process optimization, and regulatory compliance.
- Skilled in technical documentation, risk assessments, and ensuring continuity with minimal disruptions.
Preferred
- In-depth knowledge of computer architecture, high-performance programming, and parallel programming.
- Experience with network storage such as S3, MinIO, Ceph, Longhorn.
- Familiarity with SLURM, OpenMPI.
- Familiarity with AI framework such as Megatron, DeepSpeed.
- Experience in higher education or research institutions, with an understanding of core research facility operations.
- Proficiency in data analytics for process optimization and continuous improvement.
- Strong English proficiency, with fluency in additional languages as a plus.