Comment by lyu07282

4 months ago

The way modern tokenizers are constructed is by iteratively doing frequency analysis of arbitrary length sequences using a large corpus. So what you suggested is already the norm, tokens aren't n-grams. Words and any sequence really that is common enough will already be one token only, the less frequent a sequence is the more tokens it needs. That's the Byte-pair encoding algorithm:

https://en.wikipedia.org/wiki/Byte-pair_encoding

It's also not lossy compression at all, it's lossless compression if anything, unlike what some people have claimed here.

Shocking comments here, what happened to HN? People are so clueless it reads like reddit wtf

2 comments

lyu07282

alexchamberlain 4 months ago

Thanks, that's really interesting. Do they correct for spelling mistakes or internationalised spellings? For example, does `colour` and `color` end up in the same token stream?

lyu07282 4 months ago

No it just looks at exact character sequences, try it out yourself here: https://platform.openai.com/tokenizer