Comment by nl
9 hours ago
Tokenizers used to be 1 character per token. Then Google implemented Subword encoding[1] on their early neural translation work and found it was much better.
Subword units are genuinely meaningful in most languages. You do need to tune the vocabulary size though.
No comments yet
Contribute on Hacker News ↗