Comment by Lerc
4 hours ago
The comment beside the first chart
>Our main measure of progress. Bits per byte is, per Karpathy, "a much better measure than just the typical cross-entropy loss, because it further normalizes the loss on each token by the number of bytes of that token, making the metric tokenizer-invariant".
Is so blindingly obvious, that I'm ashamed to think that I didn't think do it when trialing my own tokenizer approach on tinystories. I might go back and have a look at how well my tokenizer compared to how well I imagined it compared.
ELI5 for anyone else (I had to have this explained to me):
When you train a language model, it tries to predict the next token.
We measure how good it is at that using loss aka how surprised it was by the real answer.
Different models might use different token lengths. So, if you describe loss relative to tokens then you can't easily compare the performance of two models that use different token lengths.
So, compare loss to bytes of text data instead.
Why hasn't anyone made a tokenizer that's 1 character per token. Is it because it requires an insane amount of compute?
Or would the loss of efficiency make it dumber then modern tokenizers?
Tokenizers used to be 1 character per token. Then Google implemented Subword encoding[1] on their early neural translation work and found it was much better.
Subword units are genuinely meaningful in most languages. You do need to tune the vocabulary size though.
[1] https://aclanthology.org/P16-1162/
Since OpenAI tokenizer is estimated at ~4.2 characters per token, with your proposed "1 char per token tokenizer", the effective context length immediately becomes 4.2 times smaller, and generated output 4.2 times slower (since 4.2 times more tokens are needed for the same output). Doesn't look like a good tradeoff.
yes to both.
absolutely requires longer training time and more compute.
once trained, predictions need to hold through many more steps because each step processes one token. if a token early in a sentence heavily implies a token will occur later in the sentence then that awareness needs to be maintained while processing each intermediary token and each step is a bit lossy. the fewer steps you need to take before leveraging that knowledge the better the prediction.
if you had infinite compute and data for training then performance would be equivalent though, i think.