Prepare WMT'17 Datasets · google/seq2seq#21 | Good First Issue

(3 comments) (0 reactions) (0 assignees)Python (1,329 forks)batch import

datahelp wanted

Repository metrics

Stars: (5,587 stars)
PR merge metrics: (30d に merged PR はありません)

説明

We should prepare datasets for All WMT'17 language pairs. This is also a change to try out google/sentencepiece as a preprocessor.

Each dataset should come in different configurations, i.e. different vocabulary sizes and also have a character-level version.

Together with the raw data files we also need the script that was used for the process.

コントリビューターガイド

調査方針: WMT'17データセットのリンクを調査し、sentencepieceトークン化を実装し、各言語ペアに対して設定可能な語彙サイズと文字レベルのバージョンを持つデータ前処理スクリプトを作成します。
技術スタック: pythontensorflow
領域: backenddatamachine learning
Issue 種別: 機能
難度: 2
推定時間: 半日
活動状況: 古い
明確さ: 明確
前提条件: PythonTensorFlow
初心者向け度: 75