facebookresearch/metaseq

Unify tokenizers

Open

#308 opened on Aug 20, 2022

View on GitHub
 (1 comment) (1 reaction) (1 assignee)Python (6,195 stars) (701 forks)batch import
better-engenhancementgood first issue

Description

🚀 Feature Request

With #305, we now have two ways to specify a tokenizer: with the GPT2 tokenizer (provided as two files), and with the universal HF format (specified as one file). These are in two separate code paths, but they don't need to be: we could (manually) merge the two GPT2 files into the universal HF format and switch to only that, and we should.

The resulting code would be cleaner, but catching all the other places the old method is used (e.g. API and old sweeps) needs thorough review.

Contributor guide