facebookresearch/metaseq

Unify tokenizers

Open

#308 opened on 2022年8月20日

GitHub で見る
 (1 comment) (1 reaction) (1 assignee)Python (6,195 stars) (701 forks)batch import
better-engenhancementgood first issue

説明

🚀 Feature Request

With #305, we now have two ways to specify a tokenizer: with the GPT2 tokenizer (provided as two files), and with the universal HF format (specified as one file). These are in two separate code paths, but they don't need to be: we could (manually) merge the two GPT2 files into the universal HF format and switch to only that, and we should.

The resulting code would be cleaner, but catching all the other places the old method is used (e.g. API and old sweeps) needs thorough review.

コントリビューターガイド