Application Open:
Full-Time
Job Purpose:
The purpose of this role is to design, implement, and operationalize scalable cloud infrastructure on AWS that underpins MBZUAI’s AI research, machine learning platforms, and enterprise applications. As part of the Cloud HPC team, the Cloud Implementation Engineer will play a pivotal role in enabling high-performance, secure, and cost-efficient cloud environments for real-time AI workloads and data-driven innovation. This role bridges the gap between research and production, ensuring that complex AI systems are reliably deployed and maintained at scale in the cloud. Through automation, cloud-native practices, and close collaboration with internal teams, this position supports MBZUAI’s mission to advance AI through world-class infrastructure.
Key Responsibilities:
Strategic design and implementation of AWS infrastructure to support MBZUAI’s AI research, HPC environments, and production-grade systems.
- Architect Scalable Cloud Infrastructure: Architect and implement secure and scalable AWS infrastructure tailored for high-performance computing (HPC) and AI workloads, ensuring performance, reliability, and security.
- Core AWS Services Deployment: Provision and manage essential AWS services including EC2, EBS, S3, VPC, IAM, AWS Batch, Lambda, SageMaker, and FSx for Lustre.
- Practical experience implementing and managing containerization technologies such as Docker and Kubernetes, with an understanding of microservices architecture and orchestration.
- AI/ML Optimization: Optimize cloud architectures for low-latency inference, GPU-accelerated computation, and high-throughput data transfer for machine learning pipelines and real-time applications.
- Infrastructure as Code (IaC): Design and maintain repeatable infrastructure using Terraform, AWS CDK, or CloudFormation, aligned with CI/CD practices.
- Security Architecture: Define and implement robust security measures, including IAM policies, encryption, network segmentation, GuardDuty, and security auditing.
- Proven ability to implement and maintain cloud infrastructure for robotics and Embodied AI systems on AWS.
- Cost Optimization & Governance: Plan and implement resource tagging, budget controls, and cost reports to ensure efficiency and visibility across environments.
- Stakeholder Collaboration: Partner with research, data science, and software engineering teams to translate project requirements into infrastructure blueprints.
Execution, monitoring, and support of AWS infrastructure for continuous availability and performance.
- Environment Provisioning & Lifecycle Management: Deploy and manage cloud environments across development, staging, and production, supporting both research and enterprise workloads.
- Monitoring & Observability: Set up monitoring, alerting, and telemetry tools (e.g., CloudWatch, CloudTrail, Prometheus, Grafana) to ensure performance, reliability, and fault detection.
- CI/CD Enablement: Integrate infrastructure with CI/CD pipelines using AWS CodePipeline, GitHub Actions, or Jenkins to automate infrastructure and application deployments.
- Automation & Scripting: Automate operational tasks including backups, scaling, provisioning, and compliance checks using Python, Bash, or PowerShell.
- Troubleshooting & Support: Act as the point of escalation for cloud-related issues; resolve performance, security, and deployment incidents effectively.
- Collaboration: Work closely with development, data, and operations teams to gather requirements, implement solutions, troubleshoot issues, and support full lifecycle deployment of AI and analytics workloads.
- Documentation & Knowledge Sharing: Create and maintain comprehensive, implementation-focused documentation for system architecture, deployment processes, workflows, and configuration standards to support operational transparency and onboarding.
Academic Qualifications:
Bachelor’s or Master’s degree in Computer Science, Engineering, or related field.
Professional Experience:
Essential
- 5+ years of hands-on experience with AWS cloud infrastructure implementation, including 2+ years supporting AI/ML or HPC workloads.
- Expertise in AWS services such as EC2, EFS/FSx, S3, Lambda, IAM, VPC, CloudWatch, and CloudFormation or Terraform.
- Strong Linux administration skills and knowledge of networking (DNS, VPN, firewalls, routing).
- Proficiency in scripting and automation (Python, Bash, or similar).
- Experience with container platforms (Docker, Kubernetes, or AWS ECS/EKS).
- Familiarity with building secure, compliant, multi-account AWS environments.
Preferred
- AWS Certification(s):
-
- AWS Certified Solutions Architect – Professional
- AWS Certified DevOps Engineer – Professional
- AWS Certified Machine Learning – Specialty
- Hands-on experience with:
-
- High-performance storage (Lustre, FSx)
- Real-time inference pipelines
- Event-driven/serverless architecture
- Experience working in research or academic setting supporting AI or data science workloads.
- Familiarity with cost governance tools (e.g., AWS Budgets, Cost Explorer, Trusted Advisor).
- Practical experience with AWS services (e.g., AWS RoboMaker, IoT Core, EC2, Sagemaker) for the deployment, management, and support of robotics and other Embodied AI systems.
- Ability to collaborate with cross-functional teams and communicate effectively with technical and non-technical stakeholders.
- Proactive mindset with excellent troubleshooting and problem-solving skills.
- Comfortable working in fast-paced, dynamic environments with evolving requirements.