Prepare WMT'17 Datasets · google/seq2seq#21

(3 comments) (0 reactions) (0 assignees)Python (1,329 forks)batch import

datahelp wanted

Repository metrics

We should prepare datasets for All WMT'17 language pairs. This is also a change to try out google/sentencepiece as a preprocessor.

Each dataset should come in different configurations, i.e. different vocabulary sizes and also have a character-level version.

Together with the raw data files we also need the script that was used for the process.

Research direction: Investigate WMT'17 dataset links, implement sentencepiece tokenization, and create data preprocessing scripts for each language pair with configurable vocabulary sizes and character level versions.
Tech stack: pythontensorflow
Domain: backenddatamachine learning
Issue type: Feature
Difficulty: 2
Estimated time: Half day
Activity status: Stale
Clarity: Clear
Prerequisites: PythonTensorFlow
Newbie friendliness: 75