Comment by WatchDog

6 days ago

Perhaps, the word does have it's own token, " geschniegelt"(geschniegelt with a space in front of it), is token 192786 in the tokenizer that GPT-5 apparently uses.

https://raw.githubusercontent.com/niieani/gpt-tokenizer/refs...

2 comments

WatchDog

nextaccountic 6 days ago

Isn't giving this word a token something deeply wasteful? When some more common things are multiple tokens.

Indeed, how do they deal with Chinese? Are some ideograms multiple tokens?

mudkipdev 6 days ago

It simply means the tokenizer's training corpus may have included a massive amount of German literature or accidentally oversampled a web page where that word was frequently repeated. Look up "glitch tokens" to learn more.