Application Open:
Full-Time
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) is seeking a highly skilled and innovative Storage Engineer with a strong hands-on experience to join the Robot Learning Laboratory, a state-of-the-art core facility. This role is ideal for candidates who have a strong background in storage and computer engineering, with a passion for designing, deploying, and optimizing distributed storage systems for PB-scale robotics and AI research data. With an emphasis on practical implementation complemented by research-driven innovation, this position focuses on architecting and implementing object storage, parallel file systems, and high-performance storage solutions to support data-intensive workloads.
Key Responsibilities
Storage Architecture Design
- Design distributed storage architecture for PB-scale data growth
- Architect multi-tier storage strategy (hot/warm/cold) with lifecycle policies
- Design object storage systems (Ceph/MinIO/S3-compatible)
- Plan parallel file system deployment (Lustre/BeeGFS/GPFS)
- Design storage network topology and bandwidth requirements
- Create capacity planning models and growth projections
- Define storage SLAs and performance targets
Storage System Deployment
- Deploy and configure Ceph or MinIO clusters for object storage
- Implement Lustre or BeeGFS for high-performance parallel file systems
- Configure storage protocols (S3, NFS, iSCSI, SMB)
- Set up storage replication and erasure coding for data protection
- Implement storage tiering and caching strategies
- Integrate storage with compute clusters and data pipelines
- Configure storage access controls and quotas
Performance Optimization
- Tune storage systems for high throughput and low latency
- Optimize I/O patterns for AI/ML workloads (training data loading, checkpointing)
- Implement NVMe and SSD optimization techniques
- Tune network parameters for storage traffic (jumbo frames, TCP tuning)
- Optimize metadata operations and directory structures
- Benchmark storage performance and identify bottlenecks
- Implement caching layers (Redis, Alluxio) where appropriate
Monitoring and Troubleshooting
- Implement comprehensive storage monitoring (Prometheus, Grafana, Ceph Dashboard)
- Set up alerting for capacity, performance, and health issues
- Troubleshoot storage performance degradation and failures
- Analyze storage access patterns and usage trends
- Perform root cause analysis for storage incidents
- Create runbooks and troubleshooting guides
- Conduct regular health checks and preventive maintenance
Data Protection and Disaster Recovery
- Design and implement backup strategies for critical data
- Configure snapshot and replication policies
- Plan and test disaster recovery procedures
- Implement data integrity checks and scrubbing
- Manage storage encryption (at-rest and in-transit)
- Coordinate with security team on data protection requirements
- Document backup and recovery procedures
Capacity Planning and Cost Optimization
- Monitor storage utilization and forecast capacity needs
- Implement storage lifecycle policies (archival, deletion)
- Optimize storage costs through tiering and compression
- Evaluate and recommend storage hardware upgrades
- Implement FinOps practices for storage cost tracking
- Decommission and recycle old storage hardware
- Report on storage metrics and cost efficiency
Team Leadership and Collaboration
- Mentor and guide storage administrator
- Collaborate with HPC Engineers on compute-storage integration
- Work with data architects on storage strategy
- Partner with DevOps on storage automation
- Provide technical guidance to data engineers
- Participate in architecture reviews and design discussions
- Share knowledge through documentation and training
Academic Qualifications Required
- Master degree in Computer Science, Engineering, or a related field.
- A PhD degree will be preferred.
Professional Experience Required
Essential:
- 5+ years of storage systems engineering experience
- 3+ years with distributed storage systems (Ceph/MinIO/GlusterFS)
- Distributed Storage: Expert-level experience with Ceph, MinIO, or similar object storage
- Parallel File Systems: Hands-on experience with Lustre, BeeGFS, GPFS, or WekaFS
- Storage Protocols: Deep understanding of S3, NFS, iSCSI, SMB, and FC
- Performance Tuning: Proven ability to optimize storage for high-throughput workloads
- Linux Administration: Strong Linux system administration skills (RHEL/Ubuntu)
- Networking: Understanding of storage networking (10/25/100GbE, InfiniBand)
- Hardware: Knowledge of storage hardware (HDDs, SSDs, NVMe, RAID)
- Scripting: Proficiency in Python, Bash for automation
- Monitoring: Experience with Prometheus, Grafana, or similar tools
Preferred Skills
- Cloud Storage: Experience with AWS S3, Azure Blob, or GCP Cloud Storage
- Data Lakes: Familiarity with Delta Lake, Iceberg, or Hudi
- Kubernetes: Experience with persistent storage in Kubernetes (Rook, OpenEBS)
- Emerging Technologies: Hands-on with NVMe-oF, CXL, or computational storage
- Certifications: Red Hat Certified Architect, Ceph Administrator, or similar
- Automation: Experience with Ansible, Terraform for storage automation
- Backup Solutions: Experience with Veeam, Commvault, or Bacula
- Open Source: Contributions to storage-related open-source projects
- Storage related certifications such as: Red Hat Ceph Cloud Storage (EX260), Data Storage Associate, SNIA Certified Storage Networking Expert (SCSN-E), HPE ASE- Storage Solutions, IBM Certified Solution Advisor – Spectrum Storage V7