Comment by WatchDog
6 days ago
Perhaps, the word does have it's own token, " geschniegelt"(geschniegelt with a space in front of it), is token 192786 in the tokenizer that GPT-5 apparently uses.
https://raw.githubusercontent.com/niieani/gpt-tokenizer/refs...
Isn't giving this word a token something deeply wasteful? When some more common things are multiple tokens.
Indeed, how do they deal with Chinese? Are some ideograms multiple tokens?
It simply means the tokenizer's training corpus may have included a massive amount of German literature or accidentally oversampled a web page where that word was frequently repeated. Look up "glitch tokens" to learn more.