Story: Serializing datasets · pytorch/text#140

倉庫指標

Hi Torchtext,

It would be great to have a story for saving datasets. Things are currently not in a great place, and I would like to know where it might head.

Things are not serializable. In opennmt-py, we are hacking around this issue by serializing Dataset/Field objects. This doesn't really work out of the box because of the usage of defaultdict. However we can get around that issue by monkeypatching the __getstate__ of Vocab. Maybe this could be built in.
Datasets take a ton of memory. I like that datasets are so clean, but their internal storage is not cheap, storing the field names as strings along with all of the string data itself. It's cute that conversion/batching happens on the fly, but it might be nice to be able to turn that off, i.e. convert to tensors if you want.
It requires loading everything into memory. Dataset objects are currently monolithic. They assume that the universe stored directly in them. Ideally, datasets may require being stored on disc as shards. It would be great if the loading and usages of these shards could invisible to the user.

Thanks guys. As always great work. Cheers! Sasha