deepjavalibrary/djl
View on GitHubDivergence between BertTokenizerFull and HuggingFace BertTokenizer.
Open
#2,212 opened on Dec 6, 2022
Call for Contributionbuggood first issue
Description
Description
BertTokenizerFull (precisely WordPieceTokenizer.java) in DJL is not handling subword cases that are frequent in languages like german. Problem is not present in DJL HugginFaceTokenizer, but it cannot be used on android (https://github.com/deepjavalibrary/djl/issues/2170).
Expected Behavior
For word that consists of subwords (e.g. "wochenendtagenwecker") and assuming that "wecker" (without ##) is present in vocab, "wochenendtagenwecker" should be tokenized as {"wo", "##chen", "##end", "##tagen", "wecker"}. At least, this is standard expected behavior in modern implementations of BERTTokenizer (both BertTokenizer and BertTokenizerFast in HuggingFace).
How to Reproduce?
import ai.djl.modality.nlp.DefaultVocabulary;
import ai.djl.modality.nlp.bert.BertFullTokenizer;
import java.util.Arrays;
List<String> vocab = Arrays.asList("wo", "##chen", "##end", "##tagen", "wecker", "radiowecker");
DefaultVocabulary vocabulary = new DefaultVocabulary(vocab);
BertFullTokenizer tokenizer = new BertFullTokenizer(vocabulary, true);
String a = "wochenendtagenwecker radiowecker";
String[] expected = {"wo", "##chen", "##end", "##tagen", "wecker", "radiowecker"};
List<String> tokenW = tokenizer.tokenize(a);
Steps to reproduce
add djl to gradle ;)
What have you tried to solve it?
- From what I read, probably trie or aho-corasick in WordPieceTokenizer, is needed to fix this.
Environment Info
bug is not related to environment