huggingface/datasets

Return the name of the currently loaded file in the load_dataset function.

Open

#5,806 建立於 2023年4月28日

在 GitHub 查看
 (19 留言) (2 反應) (1 負責人)Python (2,496 fork)batch import
enhancementgood first issue

倉庫指標

Star
 (18,313 star)
PR 合併指標
 (平均合併 25天 5小時) (30 天內合併 21 個 PR)

描述

Feature request

Add an optional parameter return_file_name in the load_dataset function. When it is set to True, the function will include the name of the file corresponding to the current line as a feature in the returned output.

Motivation

When training large language models, machine problems may interrupt the training process. In such cases, it is common to load a previously saved checkpoint to resume training. I would like to be able to obtain the names of the previously trained data shards, so that I can skip these parts of the data during continued training to avoid overfitting and redundant training time.

Your contribution

I currently use a dataset in jsonl format, so I am primarily interested in the json format. I suggest adding the file name to the returned table here https://github.com/huggingface/datasets/blob/main/src/datasets/packaged_modules/json/json.py#L92.

貢獻者指南