Prepare WMT'17 Datasets · google/seq2seq#21 | Good First Issue

(3 留言) (0 反應) (0 負責人)Python (1,329 fork)batch import

datahelp wanted

倉庫指標

Star: (5,587 star)
PR 合併指標: (30 天內沒有已合併 PR)

描述

We should prepare datasets for All WMT'17 language pairs. This is also a change to try out google/sentencepiece as a preprocessor.

Each dataset should come in different configurations, i.e. different vocabulary sizes and also have a character-level version.

Together with the raw data files we also need the script that was used for the process.

貢獻者指南

研究方向: 調查WMT'17數據集鏈接，實現sentencepiece分詞，並為每個語言對創建可配置詞彙大小和字符級版本的數據預處理腳本。
技術棧: pythontensorflow
領域: backenddatamachine learning
議題類型: 功能
難度: 2
預計時間: 半天
活動狀態: 較久未更新
清晰度: 清晰
前置要求: PythonTensorFlow
新手友善度: 75