Comment by Me1000
1 year ago
Something I've noticed with open weight models is the rush to judgment as soon as they are released. But most people aren't actually running these models in full fp16 mode with the code supplied, they're using quantized versions with the tip of tree patches to libraries like llama.cpp to get them running. And posts like this just show that it takes a bit for the software side of the model to get all the kinks worked out. We saw this with Mixtral (new architecture), CodeLlama-70b (new, very strict, prompt format), and now Gemma.
In some ways it's makes my so excited realizing how early this technology still is! There's going to be so much innovation and cool things that will get built over the next several years, and so much new stuff to learn!
Oh yes that's a fair point on precision! In fact the majority of issues for Gemma (other than the Approx v exact GELU) issue are precision based - ie it's fine in float32, but loses a lot of accuracy in bfloat16 or float16 domain!
On a related point, there's a post on r/LocalLlama/ a short while ago claiming that quantization impacts perf more than people think:
https://www.reddit.com/r/LocalLLaMA/comments/1b5uv86/perplex...
The argument is that while perplexity is used as evidence that quantized models perform almost just as well as the original float weights, perplexity tends to measure whether the output looks correct, but it doesn't measure performance (roughly equivalent to "intelligence") when you need more nuance.
I haven't been able to observe this myself, perhaps I haven't been playing with language models too much (or I haven't tried to stretch their abilities to their limits enough), but from a theoretical perspective what they say make a lot of sense. Even at the inference stage, the fine details of the implementation of the inference software and parameters could make a big difference of the performance of the models.
So I'd be very skeptical of people trying to evaluate the performance (i.e. intelligence level) of models besides using the stack (preferably down to the hardware) suggested by the party that released the model.
Oh actually this missed me! I normally follow LocalLlama a lot, but just recently forgot to!
In terms of quantizations losing accuracy - this actually does happen - the perplexity seems fine, since perplexity is generally calculated from the forward pass of a model, ie not via generation. This means perplexity is the accuracy of the first token. Imagine you have 99% accuracy and 1% error due to quantization. Over 100 generated tokens, the accuracy rate is 0.99^100 = 36.6% for eg. So over long contexts, quantization definitely cause problems.
Creating quantization aware approaches where long contexts don't get affected becomes a computational challenge sadly. In terms of Unsloth specifically, finetuning is 2x faster on 16bit and quantized models :)
2 replies →
Gemma.cpp was also affected, and now fixed (https://github.com/google/gemma.cpp/pull/93). Thanks for the heads-up!
Oh great! :)