Application Open:
Full-Time
Job Purpose:
The Data Engineer will design and implement scalable, reliable, and high-performance distributed systems to support large-scale data products and business needs, including ML, NLP, CV, reporting, and growth analysis. Collaborating with cross-functional teams, the role will troubleshoot production systems, resolve code-related issues, and review user documentation. The Data Engineer will establish best engineering practices, champion code quality standards, and lead the architecture of data hardware and software solutions. This role offers the opportunity to advance MBZUAI development stack, drive innovation, and ensure scalability as we grow.
Key Responsibilities:
Design and Development of Data Systems:
- Design and implement industry-leading distributed systems that are flexible, reliable, scalable, stable, robust, and extensible.
- Build high-performance storage and computing systems to support core massive data and large-scale products.
- Develop big data systems for various purposes, including recommendation engines, machine learning (ML), natural language processing (NLP), computer vision (CV), reporting, growth analysis, and multi-dimensional analysis.
Troubleshooting and Maintenance:
- Troubleshoot production systems, identify application code-related issues, and ensure timely resolution.
- Monitor system performance, diagnose bottlenecks, and implement optimizations to enhance efficiency.
- Review and provide feedback on final user documentation to ensure accuracy and clarity.
Engineering Best Practices:
- Establish and promote solid design principles and best engineering practices for both technical and non-technical stakeholders.
- Provide input on, follow, and evangelize code quality guidelines and standards.
- Conduct code reviews to ensure adherence to best practices, maintainability, and scalability.
Data Architecture and Infrastructure:
- Take charge of the architecture of data hardware (HW) and software (SW) solutions, ensuring they meet business and technical requirements.
- Design and implement data pipelines, ETL processes, and data integration workflows.
- Optimize data storage, retrieval, and processing for performance and cost-efficiency.
Collaboration and Communication:
- Work closely with data scientists, analysts, and business stakeholders to understand data requirements and deliver actionable insights.
- Collaborate with cross-functional teams to integrate data solutions into broader systems and workflows.
- Communicate complex technical concepts effectively to non-technical stakeholders.
Innovation and Continuous Improvement:
- Stay updated with emerging technologies, tools, and trends in data engineering and big data.
- Propose and implement innovative solutions to improve data processing, storage, and analysis capabilities.
- Continuously optimize data systems to handle increasing volumes of data and evolving business needs.
Data Security and Compliance:
- Implement data security best practices to protect sensitive information and ensure compliance with regulations (e.g., GDPR, CCPA).
- Conduct regular audits and vulnerability assessments to maintain data integrity and security.
- Ensure data systems adhere to industry standards and organizational policies.
Testing and Quality Assurance:
- Develop and implement testing frameworks for data pipelines and systems to ensure reliability and accuracy.
- Collaborate with QA teams to identify and resolve data-related issues.
- Ensure data quality through validation, cleansing, and transformation processes.
DevOps and Deployment:
- Collaborate with DevOps teams to ensure smooth deployment and integration of data systems.
- Implement logging, monitoring, and alerting systems to ensure data system health and performance.
- Manage CI/CD pipelines for data engineering workflows.
Mentorship and Leadership:
- Mentor junior data engineers and team members to foster skill development and growth.
- Lead technical discussions and contribute to strategic decision-making for data engineering initiatives.
- Drive the adoption of best practices and innovative technologies within the team.
Other Duties:
- Perform all other duties as reasonably directed by the line manager that are commensurate with these functional objectives.
Academic Qualifications:
- Bachelor’s degree in Computer Science, or related field.
- A postgraduate degree will be preferred.
Professional Experience:
Essential
- Minimum 5 years of experience in one or more programming languages such as Java, C++, or Python.
- Minimum 5 years of proven experience as a Data Engineer in developing complex, high-quality Data/Software/ML/NLP/CV/AI application systems.
- Experience in petabyte-level data processing is a plus.
- Strong understanding of data platform concepts, including Data Lake, Data Warehouse, ETL, Big Data Processing, Real-time Processing, Scheduling, Monitoring, Data Governance, and Task Governance.
- Proficiency in Big Data technologies such as Hadoop, MapReduce, Hive, Spark, Metastore, Flume, Kafka, Flink, and Elasticsearch.
- Experience architecting data systems for complex business problems, including data warehousing, data ingestion, query optimization, data segregation, ETL, ELT, Redshift, EC2, S3, Azure, and AWS.
- Expertise in optimizing columnar and distributed data processing systems and infrastructure.
- Proficient in applying best-practice Design Patterns and Design Principles to software architecture and algorithms.
- Experience building enterprise software architectures such as Microservices, SOA, and MVC.
- Hands-on experience with monitoring, alerting, and logging tools like Prometheus, New Relic, Datadog, ELK stack, and distributed tracing.
- Strong knowledge of testing methodologies, including unit tests, component tests, and integration tests.
- Expertise in database technologies, including MySQL, PostgreSQL (knowledge of normal forms, ACID, isolation levels, index anatomy), and NoSQL databases like MongoDB and Redis.
- Proficiency in managing Linux environments.
- Proficient understanding of code versioning tools such as Git/GitFlow and SourceTree.
- Experience in building process management and continuous integration.
- Experience working with modern software development methodologies such as Scrum, Kanban, and XP.
Preferred
- 8+ years of industry experience.
- Experience in higher education or research institutions, with an understanding of core research facility operations.
- Proficiency in data analytics for process optimization and continuous improvement.
- Strong English proficiency, with fluency in additional languages as a plus.