Comment by Marazan

6 hours ago

If you remove the auxiliary tools and just leave the core LLM then strawberry still has an undefined number of `r`s in it.

3 comments

Marazan

p-e-w 6 hours ago

That’s false. Larger LLMs learn token decompositions through their training, and in fact modern training pipelines are designed to occasionally produce uncommon tokenizations (including splitting words into individual characters) for this reason. Frontier models have no trouble spelling words even without tools. Even many mid-sized models can do that.

kilpikaarna 6 hours ago
Wait, where can I learn more about this? I don't doubt that varying the tokenization during training improves results, but how does/would that enable token introspection?
- p-e-w 3 hours ago
  
  Because LLMs can learn that different token sequences represent the same character sequence from training context. Just like they learn much more complex patterns from context.
  You can try this out locally with any mid-sized current-gen LLM. You’ll find that it can spell out most atomic tokens from its input just fine. It simply learned to do so.