Comment by SeanAnderson
7 hours ago
ELI5 for anyone else (I had to have this explained to me):
When you train a language model, it tries to predict the next token.
We measure how good it is at that using loss aka how surprised it was by the real answer.
Different models might use different token lengths. So, if you describe loss relative to tokens then you can't easily compare the performance of two models that use different token lengths.
So, compare loss to bytes of text data instead.
No comments yet
Contribute on Hacker News ↗