Comment by danielhanchen
1 year ago
Oh yes that's a fair point on precision! In fact the majority of issues for Gemma (other than the Approx v exact GELU) issue are precision based - ie it's fine in float32, but loses a lot of accuracy in bfloat16 or float16 domain!
On a related point, there's a post on r/LocalLlama/ a short while ago claiming that quantization impacts perf more than people think:
https://www.reddit.com/r/LocalLLaMA/comments/1b5uv86/perplex...
The argument is that while perplexity is used as evidence that quantized models perform almost just as well as the original float weights, perplexity tends to measure whether the output looks correct, but it doesn't measure performance (roughly equivalent to "intelligence") when you need more nuance.
I haven't been able to observe this myself, perhaps I haven't been playing with language models too much (or I haven't tried to stretch their abilities to their limits enough), but from a theoretical perspective what they say make a lot of sense. Even at the inference stage, the fine details of the implementation of the inference software and parameters could make a big difference of the performance of the models.
So I'd be very skeptical of people trying to evaluate the performance (i.e. intelligence level) of models besides using the stack (preferably down to the hardware) suggested by the party that released the model.
Oh actually this missed me! I normally follow LocalLlama a lot, but just recently forgot to!
In terms of quantizations losing accuracy - this actually does happen - the perplexity seems fine, since perplexity is generally calculated from the forward pass of a model, ie not via generation. This means perplexity is the accuracy of the first token. Imagine you have 99% accuracy and 1% error due to quantization. Over 100 generated tokens, the accuracy rate is 0.99^100 = 36.6% for eg. So over long contexts, quantization definitely cause problems.
Creating quantization aware approaches where long contexts don't get affected becomes a computational challenge sadly. In terms of Unsloth specifically, finetuning is 2x faster on 16bit and quantized models :)
I just realized who I was replying to :)
While we're on this topic, wonder whether you have comments about this --
Given that a sentence has a lot of redundant data (grammatical constructs, etc.), saying a model has 99% accuracy might not mean much if it diverges on the "critical" tokens -- for example the keyword in a paragraph, or, the relatively surprising twist in an article.
That's kind of how I interpret "to me it seems that a low perplexity just means that the model is able to produce coherent, readable sentences" (from the LocalLlama post). A model that can write English can have a low perplexity since it's averaged out, but if it can't recall the "correct" token at the critical point, it will still underperform with the low perplexity.
Intuitively this might depend on whether "intelligence" depends on the precision in the bits. It's super hard to measure, which is why even subjective anecdotes or bare assertions like the ones in the post are still interesting.
1 reply →
Gemma.cpp was also affected, and now fixed (https://github.com/google/gemma.cpp/pull/93). Thanks for the heads-up!
Oh great! :)