Comment by pxc
8 hours ago
That's a really interesting point about Ancient Chinese and other ancient scripts. I'd love to learn more about that.
I'm also more curious about tokenizers for LLMs than I've ever been before, both for Chinese and English. I feel like to understand I'll need to look at some concrete examples, since sometimes tokenization can be per word or per character or sometimes chunks that are in between.
No comments yet
Contribute on Hacker News ↗