Storage Engineer

Home Working at MBZUAI Vacancies Storage Engineer

Vacancy Overview

Application Open:

Full-Time

 

Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) is seeking a highly skilled and innovative Storage Engineer with a strong hands-on experience to join the Robot Learning Laboratory, a state-of-the-art core facility. This role is ideal for candidates who have a strong background in storage and computer engineering, with a passion for designing, deploying, and optimizing distributed storage systems for PB-scale robotics and AI research data. With an emphasis on practical implementation complemented by research-driven innovation, this position focuses on architecting and implementing object storage, parallel file systems, and high-performance storage solutions to support data-intensive workloads.

 

Key Responsibilities

 

Storage Architecture Design

  • Design distributed storage architecture for PB-scale data growth
  • Architect multi-tier storage strategy (hot/warm/cold) with lifecycle policies
  • Design object storage systems (Ceph/MinIO/S3-compatible)
  • Plan parallel file system deployment (Lustre/BeeGFS/GPFS)
  • Design storage network topology and bandwidth requirements
  • Create capacity planning models and growth projections
  • Define storage SLAs and performance targets

Storage System Deployment

  • Deploy and configure Ceph or MinIO clusters for object storage
  • Implement Lustre or BeeGFS for high-performance parallel file systems
  • Configure storage protocols (S3, NFS, iSCSI, SMB)
  • Set up storage replication and erasure coding for data protection
  • Implement storage tiering and caching strategies
  • Integrate storage with compute clusters and data pipelines
  • Configure storage access controls and quotas

Performance Optimization

  • Tune storage systems for high throughput and low latency
  • Optimize I/O patterns for AI/ML workloads (training data loading, checkpointing)
  • Implement NVMe and SSD optimization techniques
  • Tune network parameters for storage traffic (jumbo frames, TCP tuning)
  • Optimize metadata operations and directory structures
  • Benchmark storage performance and identify bottlenecks
  • Implement caching layers (Redis, Alluxio) where appropriate

Monitoring and Troubleshooting

  • Implement comprehensive storage monitoring (Prometheus, Grafana, Ceph Dashboard)
  • Set up alerting for capacity, performance, and health issues
  • Troubleshoot storage performance degradation and failures
  • Analyze storage access patterns and usage trends
  • Perform root cause analysis for storage incidents
  • Create runbooks and troubleshooting guides
  • Conduct regular health checks and preventive maintenance

Data Protection and Disaster Recovery

  • Design and implement backup strategies for critical data
  • Configure snapshot and replication policies
  • Plan and test disaster recovery procedures
  • Implement data integrity checks and scrubbing
  • Manage storage encryption (at-rest and in-transit)
  • Coordinate with security team on data protection requirements
  • Document backup and recovery procedures

Capacity Planning and Cost Optimization

  • Monitor storage utilization and forecast capacity needs
  • Implement storage lifecycle policies (archival, deletion)
  • Optimize storage costs through tiering and compression
  • Evaluate and recommend storage hardware upgrades
  • Implement FinOps practices for storage cost tracking
  • Decommission and recycle old storage hardware
  • Report on storage metrics and cost efficiency

Team Leadership and Collaboration

  • Mentor and guide storage administrator
  • Collaborate with HPC Engineers on compute-storage integration
  • Work with data architects on storage strategy
  • Partner with DevOps on storage automation
  • Provide technical guidance to data engineers
  • Participate in architecture reviews and design discussions
  • Share knowledge through documentation and training

Academic Qualifications Required

  • Master degree in Computer Science, Engineering, or a related field.
  • A PhD degree will be preferred.

Professional Experience Required
Essential:

  • 5+ years of storage systems engineering experience
  • 3+ years with distributed storage systems (Ceph/MinIO/GlusterFS)
  • Distributed Storage: Expert-level experience with Ceph, MinIO, or similar object storage
  • Parallel File Systems: Hands-on experience with Lustre, BeeGFS, GPFS, or WekaFS
  • Storage Protocols: Deep understanding of S3, NFS, iSCSI, SMB, and FC
  • Performance Tuning: Proven ability to optimize storage for high-throughput workloads
  • Linux Administration: Strong Linux system administration skills (RHEL/Ubuntu)
  • Networking: Understanding of storage networking (10/25/100GbE, InfiniBand)
  • Hardware: Knowledge of storage hardware (HDDs, SSDs, NVMe, RAID)
  • Scripting: Proficiency in Python, Bash for automation
  • Monitoring: Experience with Prometheus, Grafana, or similar tools

Preferred Skills

  • Cloud Storage: Experience with AWS S3, Azure Blob, or GCP Cloud Storage
  • Data Lakes: Familiarity with Delta Lake, Iceberg, or Hudi
  • Kubernetes: Experience with persistent storage in Kubernetes (Rook, OpenEBS)
  • Emerging Technologies: Hands-on with NVMe-oF, CXL, or computational storage
  • Certifications: Red Hat Certified Architect, Ceph Administrator, or similar
  • Automation: Experience with Ansible, Terraform for storage automation
  • Backup Solutions: Experience with Veeam, Commvault, or Bacula
  • Open Source: Contributions to storage-related open-source projects
  • Storage related certifications such as: Red Hat Ceph Cloud Storage (EX260), Data Storage Associate, SNIA Certified Storage Networking Expert (SCSN-E), HPE ASE- Storage Solutions, IBM Certified Solution Advisor – Spectrum Storage V7

Apply Now:

Drag & Drop Files, Choose Files to Upload
Drag & Drop Files, Choose Files to Upload