Comment by nextaccountic
6 days ago
Isn't giving this word a token something deeply wasteful? When some more common things are multiple tokens.
Indeed, how do they deal with Chinese? Are some ideograms multiple tokens?
6 days ago
Isn't giving this word a token something deeply wasteful? When some more common things are multiple tokens.
Indeed, how do they deal with Chinese? Are some ideograms multiple tokens?
It simply means the tokenizer's training corpus may have included a massive amount of German literature or accidentally oversampled a web page where that word was frequently repeated. Look up "glitch tokens" to learn more.