Application Open:
Full-Time
Job Purpose:
As a Deep Learning Data Engineer, your role will be instrumental in building and maintaining our data infrastructure, with a focus on handling large-scale data for training and inference of deep learning models. You will be responsible for data crawling, cleaning, and transforming raw data into formats suitable for training complex deep learning models. Your expertise in big data platforms like MapReduce, Hadoop, Spark, and Kubernetes will be crucial in efficiently processing and managing data for large deep learning tasks.
Job Responsibilities:
- Data Crawling and Collection: Develop and implement advanced data crawling strategies to acquire vast amounts of structured and unstructured data from diverse sources, including websites, APIs, and databases.
- Data Cleaning and Preprocessing: Apply sophisticated data cleaning techniques to handle missing or inconsistent data, ensuring high-quality data for training large deep learning models.
- Data Transformation: Design and implement data transformation pipelines optimized for processing and preparing data for training complex deep learning models.
- Big Data Processing: Utilize your proficiency in big data platforms such as MapReduce, Hadoop, Spark, etc. to efficiently process and analyze large-scale datasets required for training large deep learning models.
- Database Management: Establish and manage databases tailored to store and access large volumes of processed data, ensuring data security, reliability, and efficient data retrieval.
- ETL (Extract, Transform, Load): Develop and maintain ETL workflows that effectively extract data from diverse sources, transform it to meet deep learning model requirements, and load it into data warehouses or databases.
- Performance Optimization: Optimize data processing workflows and algorithms to achieve superior performance for training and inference of large deep learning models.
- Data Modeling for Deep Learning: Collaborate closely with Data Scientists and Deep Learning Researchers to understand data requirements and design appropriate data models that cater to the needs of large and complex deep learning tasks.
- Data Governance: Implement robust data governance practices to ensure data accuracy, security, and compliance with data regulations, especially when working with sensitive data.
- Big Data Platform Management: Manage and configure big data platforms to ensure their stability, scalability, and seamless integration with deep learning workflows.
- Documentation: Document data engineering processes, data flows, and data models specific to large deep learning tasks, enabling knowledge sharing and future reference.
Requirements:
- Bachelor or Master degree in computer engineering, Computer Science, or Electrical Engineering and Computer Sciences.
- 6+ years of programming experience, solid coding skills in Python, Shell, and Java
- Good corporate capacity, good communication skills.
- Expertise in big data platforms like MapReduce, Hadoop, Spark, etc.
- Degree in Computer Science, Engineering, Statistics or a related field
- Minimum 4+ years of relevant experience as a Data Engineer within data and analytics domain
- Experience with solution architecture, data ingestion, query optimization, data segregation, ETL, ELT, AWS, EC2, S3, SQS, lambda, ElasticSearch, Redshift, CI/CD frameworks and workflows.
- Working knowledge of data platform concepts – data lake, data warehouse, ETL, big data processing (designing and supporting variety/velocity/volume), real time processing architecture for data platforms, scheduling and monitoring of ETL/ELT jobs
- RDS database like PostgreSQL and programming (preferably Java, Python), proficiency in understanding data, entity relationships, structured & unstructured data, SQL and NoSQL databases
- Knowledge of best practice in optimizing columnar and distributed data processing system and infrastructure
- Experienced in designing and implementing dimensional modelling.
- Knowledge of machine learning and data mining techniques in one or more areas of statistical modelling, text mining and information retrieval.