Return the name of the currently loaded file in the load_dataset function. · huggingface/datasets#5806

(19 Kommentare) (2 Reaktionen) (1 zugewiesene Person)Python (2.496 Forks)batch import

enhancementgood first issue

Repository-Metriken

Stars: (18.313 Stars)
PR-Merge-Metriken: (Durchschn. Merge 25T 5h) (21 gemergte PRs in 30 T)

Beschreibung

Feature request

Add an optional parameter return_file_name in the load_dataset function. When it is set to True, the function will include the name of the file corresponding to the current line as a feature in the returned output.

Motivation

When training large language models, machine problems may interrupt the training process. In such cases, it is common to load a previously saved checkpoint to resume training. I would like to be able to obtain the names of the previously trained data shards, so that I can skip these parts of the data during continued training to avoid overfitting and redundant training time.

Your contribution

I currently use a dataset in jsonl format, so I am primarily interested in the json format. I suggest adding the file name to the returned table here https://github.com/huggingface/datasets/blob/main/src/datasets/packaged_modules/json/json.py#L92.

Contributor Guide

Research-Richtung: Das Issue fordert, einen optionalen Parameter `return file name` zur Funktion `load dataset` hinzuzufügen, der den Dateinamen für jede Zeile enthält. Die Implementierung sollte das JSON Parsing Modul unter `src/datasets/packaged modules/json/json.py#L92` ändern. Stellen Sie Abwärtskompatibilität sicher und behandeln Sie verschiedene Eingabeformate. Überprüfen Sie vorhandene Kommentare auf Design Diskussionen oder bevorzugte Ansätze.
Tech Stack: python
Domain: machine learning
Issue Type: Funktion
Schwierigkeit: 2
Geschätzte Zeit: 1-3 Stunden
Aktivitätsstatus: Aktiv
Klarheit: Klar
Voraussetzungen: PythonGit
Einsteigerfreundlichkeit: 65