help wanted
Description
Hi,
we're observing an "issue" with the sentencepiece tokenizer, where multiple tokens have identical string decoding.
We've generated the vocab of size 32768 using the wikitext-103 dataset with the following code:
import sentencepiece as spm
spm.SentencePieceTrainer.Train(
" --input=wikitext-103-raw/wiki.train.raw" +
" --model_prefix=wiki_32768" +
" --model_type=bpe" +
" --vocab_size=32768" +
" --hard_vocab_limit=True" +
" --input_sentence_size=2000000" +
" --unk_id=1" +
" --bos_id=-1" +
" --eos_id=-1" +
" --pad_id=0" +
" --max_sentencepiece_length=99" +
" --split_by_unicode_script=True" +
" --split_by_number=True" +
" --split_by_whitespace=True" +
" --add_dummy_prefix=False" +
" --byte_fallback=True" +
" --remove_extra_whitespaces=False" +
" --allow_whitespace_only_pieces=True" +
" --normalization_rule_name=identity" +
" --user_defined_symbols=<D>" +
" --split_digits=True" +
" --vocabulary_output_piece_score=False")
After this we run the following inspection:
for i in range(32768):
print(i, bytes(sp.Decode([i]), 'utf-8'))
Here we observe 2 "issues":
- 80 pairs of tokens to represent the same single character strings (see the full list below). We observed that these tokens group together into 2 sets that are close to each other in IDs: the 1st set has low IDs (almost at the start of the vocab) and the second one very high (almost at the end of the vocab). What we observe in the tokenized dataset is that the tokenizer "decides" to use only one of these tokens always - usually the token with the higher ID.
- 128 tokens decode to the same
b'\xef\xbf\xbd'unknown symbol. b'\xef\xbf\xbd' seems to be the unicode replacement character. But we don't understand why there would be one since we use--split_by_unicode_script=True. Or maybe those tokens are due to the--byte_fallback=Trueoption?
Are both of these known and expected behaviors? To us it seems counterintuitive and wasteful that the vocab should use 2 token encodings to encode the same symbol.
Thanks!
List of duplicate tokens:
b'X' - stands for the string decoding (result of the sp.Decode([i]) operation)
A with [B] means that the sp.decode result for tokens A and B is identical.
Duplicate tokens (sp.Decode=b' ') 35 with [32685]
Duplicate tokens (sp.Decode=b'!') 36 with [32760]
Duplicate tokens (sp.Decode=b'"') 37 with [32710]
Duplicate tokens (sp.Decode=b'#') 38 with [32759]
Duplicate tokens (sp.Decode=b"'") 42 with [32711]
Duplicate tokens (sp.Decode=b'(') 43 with [32743]
Duplicate tokens (sp.Decode=b')') 44 with [32742]
Duplicate tokens (sp.Decode=b'*') 45 with [32725]
Duplicate tokens (sp.Decode=b',') 47 with [32706]
Duplicate tokens (sp.Decode=b'-') 48 with [32723]
Duplicate tokens (sp.Decode=b'.') 49 with [32705]
Duplicate tokens (sp.Decode=b'/') 50 with [32761]
Duplicate tokens (sp.Decode=b'0') 51 with [32733]
Duplicate tokens (sp.Decode=b'1') 52 with [32718]
Duplicate tokens (sp.Decode=b'2') 53 with [32728]
Duplicate tokens (sp.Decode=b'3') 54 with [32746]
Duplicate tokens (sp.Decode=b'4') 55 with [32749]
Duplicate tokens (sp.Decode=b'5') 56 with [32750]
Duplicate tokens (sp.Decode=b'6') 57 with [32752]
Duplicate tokens (sp.Decode=b'7') 58 with [32753]
Duplicate tokens (sp.Decode=b'8') 59 with [32751]
Duplicate tokens (sp.Decode=b'9') 60 with [32741]
Duplicate tokens (sp.Decode=b':') 61 with [32737]
Duplicate tokens (sp.Decode=b';') 62 with [32754]
Duplicate tokens (sp.Decode=b'?') 66 with [32738]
Duplicate tokens (sp.Decode=b'A') 68 with [32714]
Duplicate tokens (sp.Decode=b'B') 69 with [32722]
Duplicate tokens (sp.Decode=b'C') 70 with [32719]
Duplicate tokens (sp.Decode=b'D') 71 with [32730]
Duplicate tokens (sp.Decode=b'E') 72 with [32726]
Duplicate tokens (sp.Decode=b'F') 73 with [32734]
Duplicate tokens (sp.Decode=b'G') 74 with [32735]
Duplicate tokens (sp.Decode=b'H') 75 with [32716]
Duplicate tokens (sp.Decode=b'I') 76 with [32712]
Duplicate tokens (sp.Decode=b'J') 77 with [32744]
Duplicate tokens (sp.Decode=b'K') 78 with [32755]
Duplicate tokens (sp.Decode=b'L') 79 with [32729]
Duplicate tokens (sp.Decode=b'M') 80 with [32720]
Duplicate tokens (sp.Decode=b'N') 81 with [32731]
Duplicate tokens (sp.Decode=b'O') 82 with [32739]
Duplicate tokens (sp.Decode=b'P') 83 with [32727]
Duplicate tokens (sp.Decode=b'Q') 84 with [32763]
Duplicate tokens (sp.Decode=b'R') 85 with [32732]
Duplicate tokens (sp.Decode=b'S') 86 with [32715]
Duplicate tokens (sp.Decode=b'T') 87 with [32713]
Duplicate tokens (sp.Decode=b'U') 88 with [32757]
Duplicate tokens (sp.Decode=b'V') 89 with [32758]
Duplicate tokens (sp.Decode=b'W') 90 with [32724]
Duplicate tokens (sp.Decode=b'Y') 92 with [32748]
Duplicate tokens (sp.Decode=b'Z') 93 with [32764]
Duplicate tokens (sp.Decode=b'[') 94 with [32767]
Duplicate tokens (sp.Decode=b']') 96 with [32766]
Duplicate tokens (sp.Decode=b'_') 98 with [32717]
Duplicate tokens (sp.Decode=b'a') 100 with [32688]
Duplicate tokens (sp.Decode=b'b') 101 with [32707]
Duplicate tokens (sp.Decode=b'c') 102 with [32698]
Duplicate tokens (sp.Decode=b'd') 103 with [32696]
Duplicate tokens (sp.Decode=b'e') 104 with [32686]
Duplicate tokens (sp.Decode=b'f') 105 with [32701]
Duplicate tokens (sp.Decode=b'g') 106 with [32700]
Duplicate tokens (sp.Decode=b'h') 107 with [32694]
Duplicate tokens (sp.Decode=b'i') 108 with [32691]
Duplicate tokens (sp.Decode=b'j') 109 with [32736]
Duplicate tokens (sp.Decode=b'k') 110 with [32709]
Duplicate tokens (sp.Decode=b'l') 111 with [32695]
Duplicate tokens (sp.Decode=b'm') 112 with [32699]
Duplicate tokens (sp.Decode=b'n') 113 with [32690]
Duplicate tokens (sp.Decode=b'o') 114 with [32689]
Duplicate tokens (sp.Decode=b'p') 115 with [32704]
Duplicate tokens (sp.Decode=b'q') 116 with [32745]
Duplicate tokens (sp.Decode=b'r') 117 with [32693]
Duplicate tokens (sp.Decode=b's') 118 with [32692]
Duplicate tokens (sp.Decode=b't') 119 with [32687]
Duplicate tokens (sp.Decode=b'u') 120 with [32697]
Duplicate tokens (sp.Decode=b'v') 121 with [32708]
Duplicate tokens (sp.Decode=b'w') 122 with [32702]
Duplicate tokens (sp.Decode=b'x') 123 with [32721]
Duplicate tokens (sp.Decode=b'y') 124 with [32703]
Duplicate tokens (sp.Decode=b'z') 125 with [32740]
Duplicate tokens (sp.Decode=b'|') 127 with [32762]
Duplicate tokens (sp.Decode=b'\xef\xbf\xbd') 131 with [132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258]