Return the name of the currently loaded file in the load_dataset function. · huggingface/datasets#5806

(19 commenti) (2 reazioni) (1 assegnatario)Python (2496 fork)batch import

enhancementgood first issue

Metriche repository

Star: (18.313 star)
Metriche merge PR: (Merge medio 25g 5h) (21 PR mergiate in 30 g)

Descrizione

Feature request

Add an optional parameter return_file_name in the load_dataset function. When it is set to True, the function will include the name of the file corresponding to the current line as a feature in the returned output.

Motivation

When training large language models, machine problems may interrupt the training process. In such cases, it is common to load a previously saved checkpoint to resume training. I would like to be able to obtain the names of the previously trained data shards, so that I can skip these parts of the data during continued training to avoid overfitting and redundant training time.

Your contribution

I currently use a dataset in jsonl format, so I am primarily interested in the json format. I suggest adding the file name to the returned table here https://github.com/huggingface/datasets/blob/main/src/datasets/packaged_modules/json/json.py#L92.

Guida contributor

Direzione di ricerca: La richiesta del problema è di aggiungere un parametro opzionale `return file name` alla funzione `load dataset` che includa il nome del file per ogni riga. L'implementazione dovrebbe modificare il modulo di parsing JSON in `src/datasets/packaged modules/json/json.py#L92`. Assicurare la compatibilità con le versioni precedenti e gestire diversi formati di input. Controllare i commenti esistenti per eventuali discussioni di progettazione o approcci preferiti.
Tech stack: python
Dominio: machine learning
Tipo issue: Funzionalità
Difficoltà: 2
Tempo stimato: 1-3 ore
Stato attività: Attiva
Chiarezza: Chiara
Prerequisiti: PythonGit
Adatta ai principianti: 65