huggingface/datasets

Return the name of the currently loaded file in the load_dataset function.

Open

#5806 aperta il 28 apr 2023

Vedi su GitHub
 (19 commenti) (2 reazioni) (1 assegnatario)Python (2496 fork)batch import
enhancementgood first issue

Metriche repository

Star
 (18.313 star)
Metriche merge PR
 (Merge medio 25g 5h) (21 PR mergiate in 30 g)

Descrizione

Feature request

Add an optional parameter return_file_name in the load_dataset function. When it is set to True, the function will include the name of the file corresponding to the current line as a feature in the returned output.

Motivation

When training large language models, machine problems may interrupt the training process. In such cases, it is common to load a previously saved checkpoint to resume training. I would like to be able to obtain the names of the previously trained data shards, so that I can skip these parts of the data during continued training to avoid overfitting and redundant training time.

Your contribution

I currently use a dataset in jsonl format, so I am primarily interested in the json format. I suggest adding the file name to the returned table here https://github.com/huggingface/datasets/blob/main/src/datasets/packaged_modules/json/json.py#L92.

Guida contributor