google/seq2seq

Prepare WMT'17 Datasets

Open

#21 创建于 2017年3月11日

在 GitHub 查看
 (3 评论) (0 反应) (0 负责人)Python (5,587 star) (1,329 fork)batch import
datahelp wanted

描述

We should prepare datasets for All WMT'17 language pairs. This is also a change to try out google/sentencepiece as a preprocessor.

Each dataset should come in different configurations, i.e. different vocabulary sizes and also have a character-level version.

Together with the raw data files we also need the script that was used for the process.

贡献者指南