← Back to context

Comment by alexchamberlain

3 months ago

I'm probably one of the least educated software engineers on LLMs, so apologies if this is a very naive question. Has anyone done any research into just using words as the tokens rather than (if I understand it correctly) 2-3 characters? I understand there would be limitations with this approach, but maybe the models would be smaller overall?

The way modern tokenizers are constructed is by iteratively doing frequency analysis of arbitrary length sequences using a large corpus. So what you suggested is already the norm, tokens aren't n-grams. Words and any sequence really that is common enough will already be one token only, the less frequent a sequence is the more tokens it needs. That's the Byte-pair encoding algorithm:

https://en.wikipedia.org/wiki/Byte-pair_encoding

It's also not lossy compression at all, it's lossless compression if anything, unlike what some people have claimed here.

Shocking comments here, what happened to HN? People are so clueless it reads like reddit wtf

You will need dictionaries with millions of tokens, which will make models much larger. Also, any word that has too low frequency to appear in the dictionary is now completely unknown to your model.

Along with the other commenter, the reason the dictionary would start getting so big is that words with a stem would have all its variations being different tokens (cat, cats, sit, sitting, etc). Also any out-of-dictionary words or combo words, eg. "cat bed" would not be able to be addressed.