Prepare WMT'17 Datasets · google/seq2seq#21 | Good First Issue

(3 评论) (0 反应) (0 负责人)Python (1,329 fork)batch import

datahelp wanted

仓库指标

Star: (5,587 star)
PR 合并指标: (30 天内没有已合并 PR)

描述

We should prepare datasets for All WMT'17 language pairs. This is also a change to try out google/sentencepiece as a preprocessor.

Each dataset should come in different configurations, i.e. different vocabulary sizes and also have a character-level version.

Together with the raw data files we also need the script that was used for the process.

贡献者指南

研究方向: 调查WMT'17数据集链接，实现sentencepiece分词，并为每个语言对创建可配置词汇大小和字符级版本的数据预处理脚本。
技术栈: pythontensorflow
领域: backenddatamachine learning
议题类型: 功能
难度: 2
预计时间: 半天
活动状态: 较久未更新
清晰度: 清晰
前置要求: PythonTensorFlow
新手友好度: 75