Comment by taeric

1 year ago

This is just more tokens? And probably requires the model to learn about common groups. Consider, "ess" makes sense to see as a group. "Wss" does not.

That is, the groups are encoding something the model doesn't have to learn.

This is not much astray from "sight words" we teach kids.

6 comments

taeric

Hendrikto 1 year ago

No, actually much fewer tokens. 256 tokens cover all bytes. See the ByT5 paper: https://arxiv.org/abs/2105.13626

taeric 1 year ago
More tokens to a sequence, though. And since it is learning sequences...
- loa_in_ 1 year ago
  
  Yeah, suddenly 16k tokens is just 16kb of ASCII instead of ~6kwords

TZubiri 1 year ago

This is just more tokens?

Yup. Just let the actual ML git gud

taeric 1 year ago
So, put differently, this is just more expensive?
- TZubiri 1 year ago
  
  Expensive in terms of computationally expensive, time expensive, and yes cost expensive.
  Worth noting that the relationship between characters to token ratio is probably quadratic or cubic or some other polynomial. So the difference in terms of computational difficulty is probably huge when compared to a character per token.