← Back to context

Comment by hnfong

1 year ago

On a related point, there's a post on r/LocalLlama/ a short while ago claiming that quantization impacts perf more than people think:

https://www.reddit.com/r/LocalLLaMA/comments/1b5uv86/perplex...

The argument is that while perplexity is used as evidence that quantized models perform almost just as well as the original float weights, perplexity tends to measure whether the output looks correct, but it doesn't measure performance (roughly equivalent to "intelligence") when you need more nuance.

I haven't been able to observe this myself, perhaps I haven't been playing with language models too much (or I haven't tried to stretch their abilities to their limits enough), but from a theoretical perspective what they say make a lot of sense. Even at the inference stage, the fine details of the implementation of the inference software and parameters could make a big difference of the performance of the models.

So I'd be very skeptical of people trying to evaluate the performance (i.e. intelligence level) of models besides using the stack (preferably down to the hardware) suggested by the party that released the model.

Oh actually this missed me! I normally follow LocalLlama a lot, but just recently forgot to!

In terms of quantizations losing accuracy - this actually does happen - the perplexity seems fine, since perplexity is generally calculated from the forward pass of a model, ie not via generation. This means perplexity is the accuracy of the first token. Imagine you have 99% accuracy and 1% error due to quantization. Over 100 generated tokens, the accuracy rate is 0.99^100 = 36.6% for eg. So over long contexts, quantization definitely cause problems.

Creating quantization aware approaches where long contexts don't get affected becomes a computational challenge sadly. In terms of Unsloth specifically, finetuning is 2x faster on 16bit and quantized models :)

  • I just realized who I was replying to :)

    While we're on this topic, wonder whether you have comments about this --

    Given that a sentence has a lot of redundant data (grammatical constructs, etc.), saying a model has 99% accuracy might not mean much if it diverges on the "critical" tokens -- for example the keyword in a paragraph, or, the relatively surprising twist in an article.

    That's kind of how I interpret "to me it seems that a low perplexity just means that the model is able to produce coherent, readable sentences" (from the LocalLlama post). A model that can write English can have a low perplexity since it's averaged out, but if it can't recall the "correct" token at the critical point, it will still underperform with the low perplexity.

    Intuitively this might depend on whether "intelligence" depends on the precision in the bits. It's super hard to measure, which is why even subjective anecdotes or bare assertions like the ones in the post are still interesting.

    • Hey! I agree if it can't recall the correct token at a "critical point", then definitely even perplexity is low, the sentence becomes unusable.

      The main issue is perplexity is just the exp(CE_loss), so essentially minimizing cross entropy loss is the same as minimizing perplexity. And CE is just P(of the next token)

      We need some new loss function which probably minimizes say the token of the 2nd or 3rd token which can probably be more effective - sadly it's more computationally expensive, and probably in the long run, might be equivalent to just minimizing CE.

      Ye intelligence sadly is still hard to define