Return the name of the currently loaded file in the load_dataset function. · huggingface/datasets#5806

(19 commentaires) (2 réactions) (1 assigné)Python (2 496 forks)batch import

enhancementgood first issue

Métriques du dépôt

Stars: (18 313 stars)
Métriques de merge PR: (Merge moyen 25j 5h) (21 PRs mergées en 30 j)

Description

Feature request

Add an optional parameter return_file_name in the load_dataset function. When it is set to True, the function will include the name of the file corresponding to the current line as a feature in the returned output.

Motivation

When training large language models, machine problems may interrupt the training process. In such cases, it is common to load a previously saved checkpoint to resume training. I would like to be able to obtain the names of the previously trained data shards, so that I can skip these parts of the data during continued training to avoid overfitting and redundant training time.

Your contribution

I currently use a dataset in jsonl format, so I am primarily interested in the json format. I suggest adding the file name to the returned table here https://github.com/huggingface/datasets/blob/main/src/datasets/packaged_modules/json/json.py#L92.

Guide contributeur

Direction de recherche: Le problème demande d'ajouter un paramètre optionnel `return file name` à la fonction `load dataset` qui inclut le nom du fichier pour chaque ligne. L'implémentation doit modifier le module d'analyse JSON à `src/datasets/packaged modules/json/json.py#L92`. Assurez la rétrocompatibilité et gérez les différents formats d'entrée. Vérifiez les commentaires existants pour toute discussion de conception ou approches préférées.
Stack technique: python
Domaine: machine learning
Type d'issue: Fonctionnalité
Difficulté: 2
Temps estimé: 1-3 heures
Statut d'activité: Active
Clarté: Claire
Prérequis: PythonGit
Accessibilité débutant: 65