google/seq2seq

Prepare WMT'17 Datasets

Open

#21 建立於 2017年3月11日

在 GitHub 查看
 (3 留言) (0 反應) (0 負責人)Python (5,587 star) (1,329 fork)batch import
datahelp wanted

描述

We should prepare datasets for All WMT'17 language pairs. This is also a change to try out google/sentencepiece as a preprocessor.

Each dataset should come in different configurations, i.e. different vocabulary sizes and also have a character-level version.

Together with the raw data files we also need the script that was used for the process.

貢獻者指南