Comment by cschep

1 year ago

How would we train it? Don't we need it to understand the heaps and heaps of data we already have "tokenized" e.g. the internet? Written words for humans? Genuinely curious how we could approach it differently?

12 comments

cschep

skylerwiernik 1 year ago

Couldn't we just make every human readable character a token?

OpenAI's tokenizer makes "chess" "ch" and "ess". We could just make it into "c" "h" "e" "s" "s"

cco 1 year ago
We can, tokenization is literally just to maximize resources and provide as much "space" as possible in the context window.
There is no advantage to tokenization, it just helps solve limitations in context windows and training.
- TZubiri 1 year ago
  
  I like this explanation
taeric 1 year ago
This is just more tokens? And probably requires the model to learn about common groups. Consider, "ess" makes sense to see as a group. "Wss" does not.
That is, the groups are encoding something the model doesn't have to learn.
This is not much astray from "sight words" we teach kids.
- Hendrikto 1 year ago
  
  No, actually much fewer tokens. 256 tokens cover all bytes. See the ByT5 paper: https://arxiv.org/abs/2105.13626
  
  2 replies →
- TZubiri 1 year ago
  
  This is just more tokens?
  Yup. Just let the actual ML git gud
  
  2 replies →
tchalla 1 year ago

aka Character Language Models which have existed for a while now.

viraptor 1 year ago

That's not what tokenized means here. Parent is asking to provide the model with separate characters rather than tokens, i.e. groups of characters.