Comment by hnfong
1 year ago
I just realized who I was replying to :)
While we're on this topic, wonder whether you have comments about this --
Given that a sentence has a lot of redundant data (grammatical constructs, etc.), saying a model has 99% accuracy might not mean much if it diverges on the "critical" tokens -- for example the keyword in a paragraph, or, the relatively surprising twist in an article.
That's kind of how I interpret "to me it seems that a low perplexity just means that the model is able to produce coherent, readable sentences" (from the LocalLlama post). A model that can write English can have a low perplexity since it's averaged out, but if it can't recall the "correct" token at the critical point, it will still underperform with the low perplexity.
Intuitively this might depend on whether "intelligence" depends on the precision in the bits. It's super hard to measure, which is why even subjective anecdotes or bare assertions like the ones in the post are still interesting.
Hey! I agree if it can't recall the correct token at a "critical point", then definitely even perplexity is low, the sentence becomes unusable.
The main issue is perplexity is just the exp(CE_loss), so essentially minimizing cross entropy loss is the same as minimizing perplexity. And CE is just P(of the next token)
We need some new loss function which probably minimizes say the token of the 2nd or 3rd token which can probably be more effective - sadly it's more computationally expensive, and probably in the long run, might be equivalent to just minimizing CE.
Ye intelligence sadly is still hard to define