huggingface/datasets

Return the name of the currently loaded file in the load_dataset function.

Open

#5,806 opened on Apr 28, 2023

View on GitHub
 (19 comments) (2 reactions) (1 assignee)Python (18,313 stars) (2,496 forks)batch import
enhancementgood first issue

Description

Feature request

Add an optional parameter return_file_name in the load_dataset function. When it is set to True, the function will include the name of the file corresponding to the current line as a feature in the returned output.

Motivation

When training large language models, machine problems may interrupt the training process. In such cases, it is common to load a previously saved checkpoint to resume training. I would like to be able to obtain the names of the previously trained data shards, so that I can skip these parts of the data during continued training to avoid overfitting and redundant training time.

Your contribution

I currently use a dataset in jsonl format, so I am primarily interested in the json format. I suggest adding the file name to the returned table here https://github.com/huggingface/datasets/blob/main/src/datasets/packaged_modules/json/json.py#L92.

Contributor guide