Prepare WMT'17 Datasets · google/seq2seq#21

(3 comments) (0 reactions) (0 assignees)Python (1.329 forks)batch import

datahelp wanted

Métricas do repositório

We should prepare datasets for All WMT'17 language pairs. This is also a change to try out google/sentencepiece as a preprocessor.

Each dataset should come in different configurations, i.e. different vocabulary sizes and also have a character-level version.

Together with the raw data files we also need the script that was used for the process.

Direção de pesquisa: Investigue os links do conjunto de dados WMT'17, implemente a tokenização com sentencepiece e crie scripts de pré processamento para cada par de idiomas com tamanhos de vocabulário configuráveis e versões em nível de caractere.
Pilha de tecnologia: pythontensorflow
Domain: backenddatamachine learning
Tipo Issue: Funcionalidade
Difficulty: 2
Tempo estimado: Meio dia
Status da atividade: Antigo
Clarity: Claro
Prerequisites: PythonTensorFlow
Simpatia para novatos: 75