Comment by lyu07282
3 months ago
The way modern tokenizers are constructed is by iteratively doing frequency analysis of arbitrary length sequences using a large corpus. So what you suggested is already the norm, tokens aren't n-grams. Words and any sequence really that is common enough will already be one token only, the less frequent a sequence is the more tokens it needs. That's the Byte-pair encoding algorithm:
https://en.wikipedia.org/wiki/Byte-pair_encoding
It's also not lossy compression at all, it's lossless compression if anything, unlike what some people have claimed here.
Shocking comments here, what happened to HN? People are so clueless it reads like reddit wtf
Thanks, that's really interesting. Do they correct for spelling mistakes or internationalised spellings? For example, does `colour` and `color` end up in the same token stream?
No it just looks at exact character sequences, try it out yourself here: https://platform.openai.com/tokenizer