pytorch/text

Story: Serializing datasets

Open

#140 建立於 2017年10月10日

在 GitHub 查看
 (16 留言) (18 反應) (0 負責人)Python (3,396 star) (822 fork)batch import
enhancementhelp wanted

描述

Hi Torchtext,

It would be great to have a story for saving datasets. Things are currently not in a great place, and I would like to know where it might head.

  1. Things are not serializable. In opennmt-py, we are hacking around this issue by serializing Dataset/Field objects. This doesn't really work out of the box because of the usage of defaultdict. However we can get around that issue by monkeypatching the __getstate__ of Vocab. Maybe this could be built in.

  2. Datasets take a ton of memory. I like that datasets are so clean, but their internal storage is not cheap, storing the field names as strings along with all of the string data itself. It's cute that conversion/batching happens on the fly, but it might be nice to be able to turn that off, i.e. convert to tensors if you want.

  3. It requires loading everything into memory. Dataset objects are currently monolithic. They assume that the universe stored directly in them. Ideally, datasets may require being stored on disc as shards. It would be great if the loading and usages of these shards could invisible to the user.

Thanks guys. As always great work. Cheers! Sasha

貢獻者指南

Story: Serializing datasets · pytorch/text#140 | Good First Issue