Invalid suffix of raw dataset when preprocessing without language · facebookresearch/fairseq#1426

(2 comments) (0 reactions) (0 assignees)Python (6,224 forks)batch import

bughelp wanted

Repository metrics

Stars: (29,107 stars)
PR merge metrics: (No merged PRs in 30d)

Description

When preprocessing using --dataset-impl raw and no source and target languages are specified, the datasets are stored under train.None-None due to this line:

https://github.com/pytorch/fairseq/blob/5349052aae4ec1350822c894fbb6be350dff61a0/preprocess.py#L218

Is this expected behavior or can we remove this suffix?

Contributor guide

Research direction: Inspect the preprocessing code at line 218 of preprocess.py to determine if the language suffix should be removed when no languages are specified. Consider whether this suffix affects dataset loading and if it's safe to remove.
Tech stack: python
Domain: backend
Issue type: Bug
Difficulty: 2
Estimated time: 1-3 hours
Activity status: Active
Clarity: Clear
Prerequisites: GitPython
Newbie friendliness: 75

Repository metrics

Description

Contributor guide

Get fresh easy issues in your inbox.