Cloud Support Engineer – AWS

Home Working at MBZUAI Vacancies Cloud Support Engineer – AWS

Vacancy Overview

Application Open:

Full-Time

Job Purpose:

The Cloud Support Engineer – AWS plays a critical role in the day-to-day operations, monitoring, and support of MBZUAI’s AWS-based infrastructure and services. This role ensures the reliability, performance, and security of cloud platforms supporting AI research, HPC workloads, and enterprise systems. The engineer will work closely with DevOps, research, and infrastructure teams to diagnose and resolve issues, automate routine tasks, and maintain cloud service health, contributing to MBZUAI’s mission to advance AI innovation through stable, performance cloud operations.

Key Responsibilities:

Strategic and technical activities that enhance reliability, performance, and supportability of AWS environments.

  • Cloud Service Support: Provide technical support for AWS services including SageMaker, EC2, EBS, S3, VPC, IAM, Lambda, Batch, and FSx, ensuring availability and performance.
  • Issue Diagnosis & Resolution: Analyse, troubleshoot, and resolve infrastructure and service-level issues across staging, test, and production environments.
  • Performance Optimization: Identify performance bottlenecks and propose optimizations for compute, storage, and network layers in AI/HPC workloads.
  • Incident & Problem Management: Lead root cause analysis (RCA) and support incident response to ensure minimal impact on research and production systems.
  • Knowledge Base Development: Create and maintain internal documentation, FAQs, and technical guides to reduce time to resolution and enable knowledge sharing.
  • Collaboration & Escalation: Interface with AWS support and internal engineering teams to escalate and resolve complex service issues.

Execution and automation of support workflows, monitoring, and maintenance activities.

  • Monitoring & Alerting: Configure and maintain dashboards, alerts, and logs using CloudWatch, CloudTrail, and third-party tools like Prometheus, Datadog, or Grafana.
  • System Health Checks: Perform regular audits and health checks on compute nodes, networking components, storage systems, and IAM policies.
  • Automation & Scripting: Automate routine support and operational tasks using scripts (e.g., Python, Bash, PowerShell).
  • User Support & Access Control: Manage user access, permissions, and provisioning workflows in a secure, auditable manner.
  • Backup & Recovery: Monitor backup jobs, conduct restore testing, and ensure compliance with data retention policies.
  • Compliance Support: Assist with enforcement of security controls and audit requirements (e.g., tagging policies, resource limits, governance frameworks).

Academic Qualifications:

  • Bachelor’s or Master’s degree in Computer Science, Engineering, or related field.

Professional Experience:

Essential

  • 3+ years of experience in AWS cloud support, systems administration, or cloud operations roles.
  • Prior experience supporting cloud-native environments, preferably in research, academic, or HPC contexts.
  • Expertise in AWS services such as SageMaker, EC2, EFS/FSx, S3, Lambda, IAM, VPC, CloudWatch, and CloudFormation or Terraform.
  • Strong Linux administration skills and knowledge of networking (DNS, VPN, firewalls, routing).
  • Proficiency in scripting and automation (Python, Bash, or similar).
  • Experience with container platforms (Docker, Kubernetes, or AWS ECS/EKS).
  • Familiarity with building secure, compliant, multi-account AWS environments.

Preferred

  • AWS Certification(s) such as:
    1. AWS Certified SysOps Administrator – Associate
    2. AWS Certified Solutions Architect – Associate
    3. AWS Certified DevOps Engineer – Professional
    4. Experience in supporting AI/ML or HPC workloads in a production environment.
  • Familiarity with IaC tools (Terraform, CloudFormation), container platforms (ECS, EKS), and CI/CD tools.
  • Experience with multi-account AWS Organizations and cost governance practices.
  • Experience with AWS services (e.g., AWS RoboMaker, IoT Core, EC2, Sagemaker) for the management and support of robotics and other Embodied AI systems.
  • Exposure to compliance/audit frameworks (e.g., CIS, ISO, GDPR) is an advantage.

Apply Now:

Please enable JavaScript in your browser to complete this form.
Click or drag a file to this area to upload.
Click or drag a file to this area to upload.