Perhaps, the word does have it's own token, " geschniegelt"(geschniegelt with a space in front of it), is token 192786 in the tokenizer that GPT-5 apparently uses.
It simply means the tokenizer's training corpus may have included a massive amount of German literature or accidentally oversampled a web page where that word was frequently repeated. Look up "glitch tokens" to learn more.
Perhaps, the word does have it's own token, " geschniegelt"(geschniegelt with a space in front of it), is token 192786 in the tokenizer that GPT-5 apparently uses.
https://raw.githubusercontent.com/niieani/gpt-tokenizer/refs...
Isn't giving this word a token something deeply wasteful? When some more common things are multiple tokens.
Indeed, how do they deal with Chinese? Are some ideograms multiple tokens?
It simply means the tokenizer's training corpus may have included a massive amount of German literature or accidentally oversampled a web page where that word was frequently repeated. Look up "glitch tokens" to learn more.
Based on their tokenizer tool[1], for GPT 5.x "geschniegelt" is tokenized into three tokens:
[1]: https://platform.openai.com/tokenizer
It's a single token in the most common usage, that is, with a space in front of it
"This word is geschniegelt" is [2500, 2195, 382, 192786]
Last token here is " geschniegelt"
Maybe this is why? Most of the training data has the single token version, so the three tokens version was undertrained?