pytorch/text

Story: Serializing datasets

Open

#140 opened on 2017年10月10日

GitHub で見る
 (16 comments) (18 reactions) (0 assignees)Python (3,396 stars) (822 forks)batch import
enhancementhelp wanted

説明

Hi Torchtext,

It would be great to have a story for saving datasets. Things are currently not in a great place, and I would like to know where it might head.

  1. Things are not serializable. In opennmt-py, we are hacking around this issue by serializing Dataset/Field objects. This doesn't really work out of the box because of the usage of defaultdict. However we can get around that issue by monkeypatching the __getstate__ of Vocab. Maybe this could be built in.

  2. Datasets take a ton of memory. I like that datasets are so clean, but their internal storage is not cheap, storing the field names as strings along with all of the string data itself. It's cute that conversion/batching happens on the fly, but it might be nice to be able to turn that off, i.e. convert to tensors if you want.

  3. It requires loading everything into memory. Dataset objects are currently monolithic. They assume that the universe stored directly in them. Ideally, datasets may require being stored on disc as shards. It would be great if the loading and usages of these shards could invisible to the user.

Thanks guys. As always great work. Cheers! Sasha

コントリビューターガイド