Application Open:
Full-Time
Job Purpose:
MBZUAI requires a Data Engineer specializing in Natural Language Processing (NLP) and large-scale data processing to quickly and effectively gather, curate, and prepare high-quality datasets to support cutting-edge NLP research. The role will be instrumental in enabling researchers by delivering essential data through efficient and scalable engineering practices, including web crawling, LLM-generated content refinement, and robust data pipelines, primarily leveraging Python and related technologies.
Key Responsibilities:
- Rapidly collect, curate, and preprocess datasets based on detailed specifications provided by NLP researchers, delivering data within tight timelines (typically within 1-2 days).
- Develop and maintain efficient web crawling solutions, APIs, and automated workflows to continuously improve data collection processes.
- Refine and evaluate outputs from Large Language Models (LLMs) to generate structured datasets suitable for model training and benchmarking.
- Implement scalable data pipelines, ensuring efficient data processing, storage, retrieval, and distribution to research teams.
- Collaborate closely with researchers and engineers to ensure collected data meets specified quality and relevance criteria.
- Document data collection methodologies, dataset characteristics, and pipeline architecture clearly and effectively.
- Engage with peer teams and participate in technical reviews to uphold best practices and data quality standards.
- Represent MBZUAI at industry and research forums, showcasing technical capabilities in large-scale data processing and AI data infrastructure.
- Perform all other duties as reasonably directed by the line manager commensurate with these functional objectives.
Institutional Strategy:
- Ensure timely provision of data and reports to management and recommend strategic and operational improvements to support planning, decision-making, and continuous improvement.
- Analyze monthly and quarterly performance results, identify areas for remediation, and if required take prompt, corrective action to meet Institutional Function Models and MBZUAI goals.
- Ensure priorities and plans at every level fully align with divisional and MBZUAI strategic objectives.
Financial & Organizational Coordination:
- Contribute to accurate demand planning by providing department data and information to inform accurate forecasting of resource requirements.
- Promote cross-collaboration between Research departments and sections.
Adapt on business change and continuous improvement by identifying opportunities and new requirements including improvements in IT- and AI- enabled processes, data reporting, and analytics.
Academic Qualifications:
- Bachelor’s degree in Computer Science, Data Science, Engineering, or a related technical field required
- Master’s degree or equivalent experience in Computer Science, Data Engineering, or related technical fields preferred.
Professional Experience:
Essential
- Extensive experience in data engineering, data processing, and automation using Python.
- Demonstrated proficiency in designing and deploying web crawling solutions, automated data extraction, and processing pipelines.
- Strong understanding of data structures, algorithms, databases, SQL, and performance optimization.
- Experience working with cloud infrastructure and distributed data processing frameworks (e.g., AWS, Spark, Kafka, Kubernetes).
- Excellent problem-solving abilities, attention to detail, and the capability to rapidly address technical challenges.
- Strong communication and collaboration skills with cross-functional teams.
Preferred
- Proven track record of supporting NLP or AI research teams with rapid and reliable data delivery.
- Experience with refining outputs from large-scale AI models, such as LLM-generated data.
- Contributions to open-source projects, coding competitions, or high visibility in coding communities (e.g., GitHub, Stack Overflow).
- Familiarity with the latest advancements in NLP data processing and large language model technologies.