datahelp wanted
Description
We should prepare datasets for All WMT'17 language pairs. This is also a change to try out google/sentencepiece as a preprocessor.
Each dataset should come in different configurations, i.e. different vocabulary sizes and also have a character-level version.
Together with the raw data files we also need the script that was used for the process.