Comment by lostmsu
10 days ago
> without completely lobotomizing it
The question in case of quants is: will they lobotomize it beyond the point where it would be better to switch to a smaller model like GPT-OSS 120B that comes prequantized to ~60GB.
10 days ago
> without completely lobotomizing it
The question in case of quants is: will they lobotomize it beyond the point where it would be better to switch to a smaller model like GPT-OSS 120B that comes prequantized to ~60GB.
In general, quantizing down to 6 bits gives no measurable loss in performance. Down to 4 bits gives small measurable loss in performance. It starts dropping faster at 3 bits, and at 1 bit it can fall below the performance of the next smaller model in the family (where families tend to have model sizes at factors of 4 in number of parameters)
So in the same family, you can generally quantize all the way down to 2 bits before you want to drop down to the next smaller model size.
Between families, there will obviously be more variation. You really need to have evals specific to your use case if you want to compare them, as there can be quite different performance on different types of problems between model families, and because of optimizing for benchmakrs it's really helpful to have your own to really test it out.
Did you run say SWE Bench Verified? Where does this claim coming from? It's just an urban legend.
> In general, quantizing down to 6 bits gives no measurable loss in performance.
...this can't be literally true or no one (including e.g. OpenAI) would use > 6 bits, right?
NVIDIA is showing training at 4 bits (NVPF4), and 4 bit quants have been standard for running LLMs at home for quite a while because performance was good enough.
I mean, GPT-OSS is delivered as a 4 bit model; and apparently they even trained it at 4 bits. Many train at 16 bits because it provides improved stability for gradient descent, but there are methods that allow even training at smaller quantizations efficiently.
There was a paper that I had been looking at, that I can't find right now, that demonstrated what I mentioned, it showed only imperceptible changes down to 6 bit quants, then performance decreasing more and more rapidly until it crossed over the next smaller model at 1 bit. But unfortunately, I can't seem to find it again.
There's this article from Unsloth, where they show MMLU scores for quantized Llama 4 models. They are of an 8 bit base model, so not quite the same as comparing to 16 bit models, but you see no reduction in score at 6 bits, while it starts falling after that. https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs/uns...
Anyhow, like anything in machine learning, if you want to be certain, you probably need to run your own evals. But when researching, I found enough evidence that down to 6 bit quants you really lose very little performance, and even at much smaller quants the number of parameters tends to be more important than the quantization, all the way down to 2 bits, that it acts as a good rule of thumb, and I'll generally grab a 6 to 8 bit quant to save on RAM without really thinking about, and I try out models down to 2 bits if I need to in order to fit them into my system.
3 replies →