← Back to context

Comment by danielhanchen

1 year ago

Hey! I agree if it can't recall the correct token at a "critical point", then definitely even perplexity is low, the sentence becomes unusable.

The main issue is perplexity is just the exp(CE_loss), so essentially minimizing cross entropy loss is the same as minimizing perplexity. And CE is just P(of the next token)

We need some new loss function which probably minimizes say the token of the 2nd or 3rd token which can probably be more effective - sadly it's more computationally expensive, and probably in the long run, might be equivalent to just minimizing CE.

Ye intelligence sadly is still hard to define