Divergence between BertTokenizerFull and HuggingFace BertTokenizer. · deepjavalibrary/djl#2212

Repository metrics

Stars: (3,706 stars)
PR merge metrics: (No merged PRs in 30d)

Description

BertTokenizerFull (precisely WordPieceTokenizer.java) in DJL is not handling subword cases that are frequent in languages like german. Problem is not present in DJL HugginFaceTokenizer, but it cannot be used on android (https://github.com/deepjavalibrary/djl/issues/2170).

Expected Behavior

For word that consists of subwords (e.g. "wochenendtagenwecker") and assuming that "wecker" (without ##) is present in vocab, "wochenendtagenwecker" should be tokenized as {"wo", "##chen", "##end", "##tagen", "wecker"}. At least, this is standard expected behavior in modern implementations of BERTTokenizer (both BertTokenizer and BertTokenizerFast in HuggingFace).

How to Reproduce?

import ai.djl.modality.nlp.DefaultVocabulary;
import ai.djl.modality.nlp.bert.BertFullTokenizer;
import java.util.Arrays;

  List<String> vocab = Arrays.asList("wo", "##chen", "##end", "##tagen", "wecker", "radiowecker");
  DefaultVocabulary vocabulary = new DefaultVocabulary(vocab);
  BertFullTokenizer tokenizer = new BertFullTokenizer(vocabulary, true);
  
  String a = "wochenendtagenwecker radiowecker";
  String[] expected = {"wo", "##chen", "##end", "##tagen", "wecker", "radiowecker"};
  List<String> tokenW = tokenizer.tokenize(a);

Steps to reproduce

add djl to gradle ;)

What have you tried to solve it?

From what I read, probably trie or aho-corasick in WordPieceTokenizer, is needed to fix this.

Environment Info

bug is not related to environment

Contributor guide

Research direction: Examine the WordPieceTokenizer.java implementation in DJL and compare its subword handling with the HuggingFace BertTokenizer. Focus on the logic for matching subwords without '##' prefix, and consider implementing a trie based approach as suggested.
Tech stack: java
Domain: backendmachine learning
Issue type: Bug
Difficulty: 3
Estimated time: Half day
Activity status: Active
Clarity: Clear
Prerequisites: JavaBERT tokenization concepts
Newbie friendliness: 65

Repository metrics

Description

Description

Expected Behavior

How to Reproduce?

Steps to reproduce

What have you tried to solve it?

Environment Info

Contributor guide

Get fresh easy issues in your inbox.