Invalid suffix of raw dataset when preprocessing without language · facebookresearch/fairseq#1426

(2 评论) (0 反应) (0 负责人)Python (6,224 fork)batch import

bughelp wanted

仓库指标

Star: (29,107 star)
PR 合并指标: (30 天内没有已合并 PR)

描述

When preprocessing using --dataset-impl raw and no source and target languages are specified, the datasets are stored under train.None-None due to this line:

https://github.com/pytorch/fairseq/blob/5349052aae4ec1350822c894fbb6be350dff61a0/preprocess.py#L218

Is this expected behavior or can we remove this suffix?

贡献者指南

研究方向: 检查 preprocess.py 第218行的预处理代码，确定当未指定语言时是否应删除语言后缀。考虑此后缀是否影响数据集加载以及删除是否安全。
技术栈: python
领域: backend
议题类型: 缺陷
难度: 2
预计时间: 1-3 小时
活动状态: 活跃
清晰度: 清晰
前置要求: GitPython
新手友好度: 75

仓库指标

描述

贡献者指南

每天在邮箱收到新鲜 Easy issues。