Return the name of the currently loaded file in the load_dataset function. · huggingface/datasets#5806

(19 留言) (2 反應) (1 負責人)Python (2,496 fork)batch import

enhancementgood first issue

倉庫指標

Star: (18,313 star)
PR 合併指標: (平均合併 25天 5小時) (30 天內合併 21 個 PR)

描述

Feature request

Add an optional parameter return_file_name in the load_dataset function. When it is set to True, the function will include the name of the file corresponding to the current line as a feature in the returned output.

Motivation

When training large language models, machine problems may interrupt the training process. In such cases, it is common to load a previously saved checkpoint to resume training. I would like to be able to obtain the names of the previously trained data shards, so that I can skip these parts of the data during continued training to avoid overfitting and redundant training time.

Your contribution

I currently use a dataset in jsonl format, so I am primarily interested in the json format. I suggest adding the file name to the returned table here https://github.com/huggingface/datasets/blob/main/src/datasets/packaged_modules/json/json.py#L92.

貢獻者指南

研究方向: 該問題要求在 `load dataset` 函數中添加一個可選參數 `return file name`，以便為每一行包含檔案名稱。實現應修改位於 `src/datasets/packaged modules/json/json.py#L92` 的 JSON 解析模組。確保向後相容性並處理不同的輸入格式。檢查現有評論以了解任何設計討論或首選方法。
技術棧: python
領域: machine learning
議題類型: 功能
難度: 2
預計時間: 1-3 小時
活動狀態: 活躍
清晰度: 清晰
前置要求: PythonGit
新手友善度: 65

倉庫指標

描述

Feature request

Motivation

Your contribution

貢獻者指南

每天在信箱收到新鮮 Easy issues。