Comment by pxc

8 hours ago

That's a really interesting point about Ancient Chinese and other ancient scripts. I'd love to learn more about that.

I'm also more curious about tokenizers for LLMs than I've ever been before, both for Chinese and English. I feel like to understand I'll need to look at some concrete examples, since sometimes tokenization can be per word or per character or sometimes chunks that are in between.