Comment by orbital-decay

13 hours ago

Current models understand different tokenization variants perfectly, e.g. leading space vs no leading space vs one character per token. It doesn't even affect evals and behchmarks. They're also good at languages that have very flexible word formation (e.g. Slavic) and can easily invent pretty natural non-existent words without being restricted by tokenization. This ability took a bit of a hit with recent RL and code generation optimizations, but this is not related to tokenization.

>None of them realized this duality and have taken one possible interpretation.

I suspect this happens due to mode collapse and has nothing to do with the tokenization. Try this with a base model.

0 comments

orbital-decay

No comments yet

Contribute on Hacker News ↗