HPC Storage Engineer

Home Working at MBZUAI Vacancies HPC Storage Engineer

Vacancy Overview

Application Open:

Full-Time

Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) is seeking a highly skilled and innovative HPC Storage Engineer with a strong hands-on experience to join the Robot Learning Laboratory, a state-of-the-art core facility. This role is ideal for candidates who have a strong background in storage and computer engineering, with a passion for designing, deploying, and optimizing distributed storage systems for PB-scale robotics and AI research data. With an emphasis on practical implementation complemented by research-driven innovation, this position focuses on architecting and implementing object storage, parallel file systems, and high-performance storage solutions to support data-intensive workloads.

Key Responsibilities

Storage Architecture Design

Design distributed storage architecture for PB-scale data growth
Architect multi-tier storage strategy (hot/warm/cold) with lifecycle policies
Design object storage systems (Ceph/MinIO/S3-compatible)
Plan parallel file system deployment (Lustre/BeeGFS/GPFS)
Design storage network topology and bandwidth requirements
Create capacity planning models and growth projections
Define storage SLAs and performance targets

Storage System Deployment

Deploy and configure Ceph or MinIO clusters for object storage
Implement Lustre or BeeGFS for high-performance parallel file systems
Configure storage protocols (S3, NFS, iSCSI, SMB)
Set up storage replication and erasure coding for data protection
Implement storage tiering and caching strategies
Integrate storage with compute clusters and data pipelines
Configure storage access controls and quotas

Performance Optimization

Tune storage systems for high throughput and low latency
Optimize I/O patterns for AI/ML workloads (training data loading, checkpointing)
Implement NVMe and SSD optimization techniques
Tune network parameters for storage traffic (jumbo frames, TCP tuning)
Optimize metadata operations and directory structures
Benchmark storage performance and identify bottlenecks
Implement caching layers (Redis, Alluxio) where appropriate

Monitoring and Troubleshooting

Implement comprehensive storage monitoring (Prometheus, Grafana, Ceph Dashboard)
Set up alerting for capacity, performance, and health issues
Troubleshoot storage performance degradation and failures
Analyze storage access patterns and usage trends
Perform root cause analysis for storage incidents
Create runbooks and troubleshooting guides
Conduct regular health checks and preventive maintenance

Data Protection and Disaster Recovery

Design and implement backup strategies for critical data
Configure snapshot and replication policies
Plan and test disaster recovery procedures
Implement data integrity checks and scrubbing
Manage storage encryption (at-rest and in-transit)
Coordinate with security team on data protection requirements
Document backup and recovery procedures

Capacity Planning and Cost Optimization

Monitor storage utilization and forecast capacity needs
Implement storage lifecycle policies (archival, deletion)
Optimize storage costs through tiering and compression
Evaluate and recommend storage hardware upgrades
Implement FinOps practices for storage cost tracking
Decommission and recycle old storage hardware
Report on storage metrics and cost efficiency

Team Leadership and Collaboration

Mentor and guide storage administrator
Collaborate with HPC Engineers on compute-storage integration
Work with data architects on storage strategy
Partner with DevOps on storage automation
Provide technical guidance to data engineers
Participate in architecture reviews and design discussions
Share knowledge through documentation and training

Academic Qualifications Required

Master degree in Computer Science, Engineering, or a related field.
A PhD degree will be preferred.

Professional Experience Required
Essential:

5+ years of storage systems engineering experience
3+ years with distributed storage systems (Ceph/MinIO/GlusterFS)
Distributed Storage: Expert-level experience with Ceph, MinIO, or similar object storage
Parallel File Systems: Hands-on experience with Lustre, BeeGFS, GPFS, or WekaFS
Storage Protocols: Deep understanding of S3, NFS, iSCSI, SMB, and FC
Performance Tuning: Proven ability to optimize storage for high-throughput workloads
Linux Administration: Strong Linux system administration skills (RHEL/Ubuntu)
Networking: Understanding of storage networking (10/25/100GbE, InfiniBand)
Hardware: Knowledge of storage hardware (HDDs, SSDs, NVMe, RAID)
Scripting: Proficiency in Python, Bash for automation
Monitoring: Experience with Prometheus, Grafana, or similar tools

Preferred Skills

Cloud Storage: Experience with AWS S3, Azure Blob, or GCP Cloud Storage
Data Lakes: Familiarity with Delta Lake, Iceberg, or Hudi
Kubernetes: Experience with persistent storage in Kubernetes (Rook, OpenEBS)
Emerging Technologies: Hands-on with NVMe-oF, CXL, or computational storage
Certifications: Red Hat Certified Architect, Ceph Administrator, or similar
Automation: Experience with Ansible, Terraform for storage automation
Backup Solutions: Experience with Veeam, Commvault, or Bacula
Open Source: Contributions to storage-related open-source projects
Storage related certifications such as: Red Hat Ceph Cloud Storage (EX260), Data Storage Associate, SNIA Certified Storage Networking Expert (SCSN-E), HPE ASE- Storage Solutions, IBM Certified Solution Advisor – Spectrum Storage V7

Apply Now:

First Name

Last Name

Phone

Highest Qualification

Number of Years of Experience in Related Position

Nationality

CV Phone Applied

Upload CV

Drag & Drop Files, Choose Files to Upload

Upload Cover Letter

Drag & Drop Files, Choose Files to Upload