huggingface/datasets

Return the name of the currently loaded file in the load_dataset function.

Open

#5 806 ouverte le 28 avr. 2023

Voir sur GitHub
 (19 commentaires) (2 réactions) (1 assigné)Python (2 496 forks)batch import
enhancementgood first issue

Métriques du dépôt

Stars
 (18 313 stars)
Métriques de merge PR
 (Merge moyen 25j 5h) (21 PRs mergées en 30 j)

Description

Feature request

Add an optional parameter return_file_name in the load_dataset function. When it is set to True, the function will include the name of the file corresponding to the current line as a feature in the returned output.

Motivation

When training large language models, machine problems may interrupt the training process. In such cases, it is common to load a previously saved checkpoint to resume training. I would like to be able to obtain the names of the previously trained data shards, so that I can skip these parts of the data during continued training to avoid overfitting and redundant training time.

Your contribution

I currently use a dataset in jsonl format, so I am primarily interested in the json format. I suggest adding the file name to the returned table here https://github.com/huggingface/datasets/blob/main/src/datasets/packaged_modules/json/json.py#L92.

Guide contributeur