Duplicate tokens in BPE vocabulary · google/sentencepiece#881

(1 comment) (0 reactions) (0 assignees)C++ (8,754 stars) (1,072 forks)batch import

help wanted

Description

Hi,

we're observing an "issue" with the sentencepiece tokenizer, where multiple tokens have identical string decoding.

We've generated the vocab of size 32768 using the wikitext-103 dataset with the following code:

import sentencepiece as spm

spm.SentencePieceTrainer.Train(
    " --input=wikitext-103-raw/wiki.train.raw" +
    " --model_prefix=wiki_32768" +
    " --model_type=bpe" +
    " --vocab_size=32768" +
    " --hard_vocab_limit=True" +
    " --input_sentence_size=2000000" +
    " --unk_id=1" +
    " --bos_id=-1" +
    " --eos_id=-1" +
    " --pad_id=0" +
    " --max_sentencepiece_length=99" +
    " --split_by_unicode_script=True" +
    " --split_by_number=True" +
    " --split_by_whitespace=True" +
    " --add_dummy_prefix=False" +
    " --byte_fallback=True" +
    " --remove_extra_whitespaces=False" +
    " --allow_whitespace_only_pieces=True" +
    " --normalization_rule_name=identity" +
    " --user_defined_symbols=<D>" +
    " --split_digits=True" +
    " --vocabulary_output_piece_score=False")

After this we run the following inspection:

for i in range(32768):
    print(i, bytes(sp.Decode([i]), 'utf-8'))

Here we observe 2 "issues":

80 pairs of tokens to represent the same single character strings (see the full list below). We observed that these tokens group together into 2 sets that are close to each other in IDs: the 1st set has low IDs (almost at the start of the vocab) and the second one very high (almost at the end of the vocab). What we observe in the tokenized dataset is that the tokenizer "decides" to use only one of these tokens always - usually the token with the higher ID.
128 tokens decode to the same b'\xef\xbf\xbd' unknown symbol. b'\xef\xbf\xbd' seems to be the unicode replacement character. But we don't understand why there would be one since we use --split_by_unicode_script=True. Or maybe those tokens are due to the --byte_fallback=True option?

Are both of these known and expected behaviors? To us it seems counterintuitive and wasteful that the vocab should use 2 token encodings to encode the same symbol.

Thanks!

List of duplicate tokens: b'X' - stands for the string decoding (result of the sp.Decode([i]) operation) A with [B] means that the sp.decode result for tokens A and B is identical.

Duplicate tokens (sp.Decode=b' ') 35 with [32685]
Duplicate tokens (sp.Decode=b'!') 36 with [32760]
Duplicate tokens (sp.Decode=b'"') 37 with [32710]
Duplicate tokens (sp.Decode=b'#') 38 with [32759]
Duplicate tokens (sp.Decode=b"'") 42 with [32711]
Duplicate tokens (sp.Decode=b'(') 43 with [32743]
Duplicate tokens (sp.Decode=b')') 44 with [32742]
Duplicate tokens (sp.Decode=b'*') 45 with [32725]
Duplicate tokens (sp.Decode=b',') 47 with [32706]
Duplicate tokens (sp.Decode=b'-') 48 with [32723]
Duplicate tokens (sp.Decode=b'.') 49 with [32705]
Duplicate tokens (sp.Decode=b'/') 50 with [32761]
Duplicate tokens (sp.Decode=b'0') 51 with [32733]
Duplicate tokens (sp.Decode=b'1') 52 with [32718]
Duplicate tokens (sp.Decode=b'2') 53 with [32728]
Duplicate tokens (sp.Decode=b'3') 54 with [32746]
Duplicate tokens (sp.Decode=b'4') 55 with [32749]
Duplicate tokens (sp.Decode=b'5') 56 with [32750]
Duplicate tokens (sp.Decode=b'6') 57 with [32752]
Duplicate tokens (sp.Decode=b'7') 58 with [32753]
Duplicate tokens (sp.Decode=b'8') 59 with [32751]
Duplicate tokens (sp.Decode=b'9') 60 with [32741]
Duplicate tokens (sp.Decode=b':') 61 with [32737]
Duplicate tokens (sp.Decode=b';') 62 with [32754]
Duplicate tokens (sp.Decode=b'?') 66 with [32738]
Duplicate tokens (sp.Decode=b'A') 68 with [32714]
Duplicate tokens (sp.Decode=b'B') 69 with [32722]
Duplicate tokens (sp.Decode=b'C') 70 with [32719]
Duplicate tokens (sp.Decode=b'D') 71 with [32730]
Duplicate tokens (sp.Decode=b'E') 72 with [32726]
Duplicate tokens (sp.Decode=b'F') 73 with [32734]
Duplicate tokens (sp.Decode=b'G') 74 with [32735]
Duplicate tokens (sp.Decode=b'H') 75 with [32716]
Duplicate tokens (sp.Decode=b'I') 76 with [32712]
Duplicate tokens (sp.Decode=b'J') 77 with [32744]
Duplicate tokens (sp.Decode=b'K') 78 with [32755]
Duplicate tokens (sp.Decode=b'L') 79 with [32729]
Duplicate tokens (sp.Decode=b'M') 80 with [32720]
Duplicate tokens (sp.Decode=b'N') 81 with [32731]
Duplicate tokens (sp.Decode=b'O') 82 with [32739]
Duplicate tokens (sp.Decode=b'P') 83 with [32727]
Duplicate tokens (sp.Decode=b'Q') 84 with [32763]
Duplicate tokens (sp.Decode=b'R') 85 with [32732]
Duplicate tokens (sp.Decode=b'S') 86 with [32715]
Duplicate tokens (sp.Decode=b'T') 87 with [32713]
Duplicate tokens (sp.Decode=b'U') 88 with [32757]
Duplicate tokens (sp.Decode=b'V') 89 with [32758]
Duplicate tokens (sp.Decode=b'W') 90 with [32724]
Duplicate tokens (sp.Decode=b'Y') 92 with [32748]
Duplicate tokens (sp.Decode=b'Z') 93 with [32764]
Duplicate tokens (sp.Decode=b'[') 94 with [32767]
Duplicate tokens (sp.Decode=b']') 96 with [32766]
Duplicate tokens (sp.Decode=b'_') 98 with [32717]
Duplicate tokens (sp.Decode=b'a') 100 with [32688]
Duplicate tokens (sp.Decode=b'b') 101 with [32707]
Duplicate tokens (sp.Decode=b'c') 102 with [32698]
Duplicate tokens (sp.Decode=b'd') 103 with [32696]
Duplicate tokens (sp.Decode=b'e') 104 with [32686]
Duplicate tokens (sp.Decode=b'f') 105 with [32701]
Duplicate tokens (sp.Decode=b'g') 106 with [32700]
Duplicate tokens (sp.Decode=b'h') 107 with [32694]
Duplicate tokens (sp.Decode=b'i') 108 with [32691]
Duplicate tokens (sp.Decode=b'j') 109 with [32736]
Duplicate tokens (sp.Decode=b'k') 110 with [32709]
Duplicate tokens (sp.Decode=b'l') 111 with [32695]
Duplicate tokens (sp.Decode=b'm') 112 with [32699]
Duplicate tokens (sp.Decode=b'n') 113 with [32690]
Duplicate tokens (sp.Decode=b'o') 114 with [32689]
Duplicate tokens (sp.Decode=b'p') 115 with [32704]
Duplicate tokens (sp.Decode=b'q') 116 with [32745]
Duplicate tokens (sp.Decode=b'r') 117 with [32693]
Duplicate tokens (sp.Decode=b's') 118 with [32692]
Duplicate tokens (sp.Decode=b't') 119 with [32687]
Duplicate tokens (sp.Decode=b'u') 120 with [32697]
Duplicate tokens (sp.Decode=b'v') 121 with [32708]
Duplicate tokens (sp.Decode=b'w') 122 with [32702]
Duplicate tokens (sp.Decode=b'x') 123 with [32721]
Duplicate tokens (sp.Decode=b'y') 124 with [32703]
Duplicate tokens (sp.Decode=b'z') 125 with [32740]
Duplicate tokens (sp.Decode=b'|') 127 with [32762]
Duplicate tokens (sp.Decode=b'\xef\xbf\xbd') 131 with [132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258]

Contributor guide

Tech stack: pythoncpp
Domain: datamachine learning
Issue type: bug
Difficulty: 3
Estimated time: 1-3 hours
Activity status: fresh
Clarity: mostly clear
Prerequisites: Basic understanding of BPE tokenizationFamiliarity with sentencepiece library
Newbie friendliness: 20
Research direction: Investigate the root cause of duplicate tokens in the BPE vocabulary. Examine the byte fallback option and how it interacts with normalization rules. Compare the two sets of duplicate tokens (low IDs vs high IDs) and determine if they originate from different stages of training. Review the tokenization code in src/ to understand how pieces are merged and how duplicates can arise.