Comment by kilpikaarna

6 hours ago

Wait, where can I learn more about this? I don't doubt that varying the tokenization during training improves results, but how does/would that enable token introspection?

1 comment

kilpikaarna

p-e-w 3 hours ago

Because LLMs can learn that different token sequences represent the same character sequence from training context. Just like they learn much more complex patterns from context.

You can try this out locally with any mid-sized current-gen LLM. You’ll find that it can spell out most atomic tokens from its input just fine. It simply learned to do so.