facebookresearch/metaseq

Unify tokenizers

Open

#308 建立於 2022年8月20日

在 GitHub 查看
 (1 留言) (1 反應) (1 負責人)Python (6,195 star) (701 fork)batch import
better-engenhancementgood first issue

描述

🚀 Feature Request

With #305, we now have two ways to specify a tokenizer: with the GPT2 tokenizer (provided as two files), and with the universal HF format (specified as one file). These are in two separate code paths, but they don't need to be: we could (manually) merge the two GPT2 files into the universal HF format and switch to only that, and we should.

The resulting code would be cleaner, but catching all the other places the old method is used (e.g. API and old sweeps) needs thorough review.

貢獻者指南

Unify tokenizers · facebookresearch/metaseq#308 | Good First Issue