Story: Serializing datasets · pytorch/text#140

Repository metrics

Hi Torchtext,

It would be great to have a story for saving datasets. Things are currently not in a great place, and I would like to know where it might head.

Things are not serializable. In opennmt-py, we are hacking around this issue by serializing Dataset/Field objects. This doesn't really work out of the box because of the usage of defaultdict. However we can get around that issue by monkeypatching the __getstate__ of Vocab. Maybe this could be built in.
Datasets take a ton of memory. I like that datasets are so clean, but their internal storage is not cheap, storing the field names as strings along with all of the string data itself. It's cute that conversion/batching happens on the fly, but it might be nice to be able to turn that off, i.e. convert to tensors if you want.
It requires loading everything into memory. Dataset objects are currently monolithic. They assume that the universe stored directly in them. Ideally, datasets may require being stored on disc as shards. It would be great if the loading and usages of these shards could invisible to the user.

Thanks guys. As always great work. Cheers! Sasha

調査方針: Dataset/Fieldオブジェクトをシリアライズする方法、文字列をロード時にテンソルに変換してメモリ使用量を削減する方法、大規模データセットの遅延ロードまたはシャーディングを実装する方法を調査してください。
技術スタック: python
領域: datamachine learning
Issue 種別: 機能
難度: 3
推定時間: 1-2日
活動状況: アクティブ
明確さ: おおむね明確
前提条件: PythonPyTorch
初心者向け度: 30