Comment by alexchamberlain

4 months ago

I'm probably one of the least educated software engineers on LLMs, so apologies if this is a very naive question. Has anyone done any research into just using words as the tokens rather than (if I understand it correctly) 2-3 characters? I understand there would be limitations with this approach, but maybe the models would be smaller overall?

6 comments

alexchamberlain

lyu07282 4 months ago

The way modern tokenizers are constructed is by iteratively doing frequency analysis of arbitrary length sequences using a large corpus. So what you suggested is already the norm, tokens aren't n-grams. Words and any sequence really that is common enough will already be one token only, the less frequent a sequence is the more tokens it needs. That's the Byte-pair encoding algorithm:

https://en.wikipedia.org/wiki/Byte-pair_encoding

It's also not lossy compression at all, it's lossless compression if anything, unlike what some people have claimed here.

Shocking comments here, what happened to HN? People are so clueless it reads like reddit wtf

alexchamberlain 4 months ago
Thanks, that's really interesting. Do they correct for spelling mistakes or internationalised spellings? For example, does `colour` and `color` end up in the same token stream?
- lyu07282 4 months ago
  
  No it just looks at exact character sequences, try it out yourself here: https://platform.openai.com/tokenizer

murkt 4 months ago

You will need dictionaries with millions of tokens, which will make models much larger. Also, any word that has too low frequency to appear in the dictionary is now completely unknown to your model.

mhuffman 4 months ago

Along with the other commenter, the reason the dictionary would start getting so big is that words with a stem would have all its variations being different tokens (cat, cats, sit, sitting, etc). Also any out-of-dictionary words or combo words, eg. "cat bed" would not be able to be addressed.

plaguuuuuu 4 months ago

presumably anyone tokenizing chinese characters, which are basically entire words.