deepjavalibrary/djl

Divergence between BertTokenizerFull and HuggingFace BertTokenizer.

Open

#2,212 opened on Dec 6, 2022

View on GitHub
 (2 comments) (0 reactions) (0 assignees)Java (3,706 stars) (633 forks)batch import
Call for Contributionbuggood first issue

Description

Description

BertTokenizerFull (precisely WordPieceTokenizer.java) in DJL is not handling subword cases that are frequent in languages like german. Problem is not present in DJL HugginFaceTokenizer, but it cannot be used on android (https://github.com/deepjavalibrary/djl/issues/2170).

Expected Behavior

For word that consists of subwords (e.g. "wochenendtagenwecker") and assuming that "wecker" (without ##) is present in vocab, "wochenendtagenwecker" should be tokenized as {"wo", "##chen", "##end", "##tagen", "wecker"}. At least, this is standard expected behavior in modern implementations of BERTTokenizer (both BertTokenizer and BertTokenizerFast in HuggingFace).

How to Reproduce?

import ai.djl.modality.nlp.DefaultVocabulary;
import ai.djl.modality.nlp.bert.BertFullTokenizer;
import java.util.Arrays;

  List<String> vocab = Arrays.asList("wo", "##chen", "##end", "##tagen", "wecker", "radiowecker");
  DefaultVocabulary vocabulary = new DefaultVocabulary(vocab);
  BertFullTokenizer tokenizer = new BertFullTokenizer(vocabulary, true);
  
  String a = "wochenendtagenwecker radiowecker";
  String[] expected = {"wo", "##chen", "##end", "##tagen", "wecker", "radiowecker"};
  List<String> tokenW = tokenizer.tokenize(a);

Steps to reproduce

add djl to gradle ;)

What have you tried to solve it?

  1. From what I read, probably trie or aho-corasick in WordPieceTokenizer, is needed to fix this.

Environment Info

bug is not related to environment

Contributor guide