NLP Data Engineer

Home Working at MBZUAI Vacancies NLP Data Engineer

Vacancy Overview

Application Open:

Full-Time

Job Purpose:

MBZUAI requires a Data Engineer specializing in Natural Language Processing (NLP) and large-scale data processing to quickly and effectively gather, curate, and prepare high-quality datasets to support cutting-edge NLP research. The role will be instrumental in enabling researchers by delivering essential data through efficient and scalable engineering practices, including web crawling, LLM-generated content refinement, and robust data pipelines, primarily leveraging Python and related technologies.

Key Responsibilities:

  • Rapidly collect, curate, and preprocess datasets based on detailed specifications provided by NLP researchers, delivering data within tight timelines (typically within 1-2 days).
  • Develop and maintain efficient web crawling solutions, APIs, and automated workflows to continuously improve data collection processes.
  • Refine and evaluate outputs from Large Language Models (LLMs) to generate structured datasets suitable for model training and benchmarking.
  • Implement scalable data pipelines, ensuring efficient data processing, storage, retrieval, and distribution to research teams.
  • Collaborate closely with researchers and engineers to ensure collected data meets specified quality and relevance criteria.
  • Document data collection methodologies, dataset characteristics, and pipeline architecture clearly and effectively.
  • Engage with peer teams and participate in technical reviews to uphold best practices and data quality standards.
  • Represent MBZUAI at industry and research forums, showcasing technical capabilities in large-scale data processing and AI data infrastructure.
  • Perform all other duties as reasonably directed by the line manager commensurate with these functional objectives.

Institutional Strategy:

  • Ensure timely provision of data and reports to management and recommend strategic and operational improvements to support planning, decision-making, and continuous improvement.
  • Analyze monthly and quarterly performance results, identify areas for remediation, and if required take prompt, corrective action to meet Institutional Function Models and MBZUAI goals.
  • Ensure priorities and plans at every level fully align with divisional and MBZUAI strategic objectives.

Financial & Organizational Coordination:

  • Contribute to accurate demand planning by providing department data and information to inform accurate forecasting of resource requirements.
  • Promote cross-collaboration between Research departments and sections.

Adapt on business change and continuous improvement by identifying opportunities and new requirements including improvements in IT- and AI- enabled processes, data reporting, and analytics.

Academic Qualifications:

  • Bachelor’s degree in Computer Science, Data Science, Engineering, or a related technical field required
  • Master’s degree or equivalent experience in Computer Science, Data Engineering, or related technical fields preferred.

Professional Experience:

Essential

  • Extensive experience in data engineering, data processing, and automation using Python.
  • Demonstrated proficiency in designing and deploying web crawling solutions, automated data extraction, and processing pipelines.
  • Strong understanding of data structures, algorithms, databases, SQL, and performance optimization.
  • Experience working with cloud infrastructure and distributed data processing frameworks (e.g., AWS, Spark, Kafka, Kubernetes).
  • Excellent problem-solving abilities, attention to detail, and the capability to rapidly address technical challenges.
  • Strong communication and collaboration skills with cross-functional teams.

Preferred

  • Proven track record of supporting NLP or AI research teams with rapid and reliable data delivery.
  • Experience with refining outputs from large-scale AI models, such as LLM-generated data.
  • Contributions to open-source projects, coding competitions, or high visibility in coding communities (e.g., GitHub, Stack Overflow).
  • Familiarity with the latest advancements in NLP data processing and large language model technologies.

Apply Now:

Please enable JavaScript in your browser to complete this form.
Click or drag a file to this area to upload.
Click or drag a file to this area to upload.