Application Open:
Full-Time
Job Purpose:
MBZUAI is seeking an experienced Program Manager – Data Center Operations to oversee the maintenance, issue resolution, and continuous enhancement of its high-performance computing (HPC) data center. This role will be responsible for coordinating communication between vendors, IT teams, university departments, and key stakeholders to ensure efficient operations, prompt issue resolution, and the long-term reliability of the data center infrastructure.
Key Responsibilities:
Vendor and Stakeholder Coordination:
- Serve as the primary liaison between vendors, IT teams, university departments, and facility management.
- Manage vendor contracts, service agreements, and ensure timely delivery of maintenance and repair services.
- Coordinate vendor activities for hardware repairs, infrastructure upgrades, and system maintenance.
- Work closely with university leadership and research teams to align data center operations with institutional goals.
Data Center Infrastructure and Issue Resolution:
- Ensure rapid identification and resolution of hardware failures, system malfunctions, chiller issues, UPS (Uninterruptible Power Supply) failures, power disruptions, and network outages.
- Monitor server room and data center conditions, including cooling efficiency, power distribution, and overall equipment health.
- Develop and enforce incident response protocols to quickly address data center outages and operational risks.
- Work closely with IT teams and facility managers to ensure seamless operation of HPC clusters, networking, and support systems.
Preventative Maintenance and Operations Management:
- Oversee regular maintenance schedules for data center equipment, including chillers, UPS, power distribution units (PDUs), network infrastructure, and cooling systems.
- Ensure compliance with safety, security, and environmental regulations related to data center operations.
- Track maintenance records, system logs, and performance data to proactively identify potential failures before they occur.
IT and Hardware System Support Coordination:
- Work with IT teams to ensure the availability and reliability of HPC systems, storage infrastructure, and high-performance networking.
- Coordinate system upgrades and scheduled downtime with minimal disruption to research and academic activities.
- Ensure compatibility between new and existing hardware, software, and power systems in the data center.
Risk Management, Compliance, and Documentation:
- Maintain detailed records of all data center issues, repairs, and maintenance activities.
- Ensure vendor SLAs (Service Level Agreements) and warranties are properly tracked and utilized for timely hardware replacements and repairs.
- Work with IT security teams to enforce best practices in data center security and access control.
Strategic Planning and Continuous Improvement:
- Develop long-term strategies for data center expansion, infrastructure upgrades, and efficiency improvements.
- Stay up to date on emerging data center technologies to enhance performance, reliability, and energy efficiency.
- Implement cost-saving initiatives while maintaining high operational standards.
Documentation and Communication:
- Maintain comprehensive documentation on the MBZUAI SharePoint portal and facilitate communication through e-meetings, Slack discussions, and email threads.
Other Duties:
- Perform all other duties as reasonably directed by the line manager that are commensurate with these functional objectives.
Academic Qualifications:
- Bachelor’s degree in Computer Science, Information Technology, or related design field.
- Postgraduate degrees will be preferred.
Professional Experience:
Essential
- 5 or more years of experience in data center operations, IT infrastructure management, or facilities management.
- Strong project management and vendor management skills.
- Technical knowledge of server hardware, cooling systems, UPS, networking, power distribution, and data center security.
- Experience working with HPC environments, cloud infrastructure, or large-scale data centers.
- Familiarity with incident response, disaster recovery, and risk management best practices.
- Strong ability to coordinate cross-functional teams, including IT, research departments, and facility teams.
- Excellent communication, negotiation, and problem-solving skills.
- Strong problem-solving multitasking, prioritization and communication skills, with experience in multidisciplinary teams.
Preferred
- Certified Data Center Professional (CDCP) or Certified Data Center Manager (CDCM) certification.
- 7 or more years of experience, with at least 3 years in a managerial role.
- Experience in higher education or research institutions.
- Ability to develop and manage relationships with industry and academic partners to enhance MBZUAI’s research initiatives.
- Strong English proficiency, with fluency in additional languages as a plus.