Comment by 2001zhaozhao
2 days ago
Isn't between Q4-Q6 the usual recommendation for quants? Can you explain the Q8 recommendation, as I was under the impression that if you can run a model at Q8, you should probably run a bigger model in Q4 instead
2 days ago
Isn't between Q4-Q6 the usual recommendation for quants? Can you explain the Q8 recommendation, as I was under the impression that if you can run a model at Q8, you should probably run a bigger model in Q4 instead
There are no hard rules regarding quants, except less is better.
However models respond very differently, and there are tricks you can do like limiting quantization of certain layers. Some models can genrally behave fine down into sub-Q4 territory, while others don't do well below Q8 at all. And then you have the way it was quantized on top of that.
So either find some actual benchmarks, which can be rare, or you just have to try.
As an example, Unsloth recently released some benchmarks[1] which showed Qwen3.5 35B tolerating quantization very well, except for a few layers which was very sensitive.
edit: Unsloth has a page detailing their updated quantization method here[2], which was just submitted[3].
[1]: https://news.ycombinator.com/item?id=47192505
if you can run Q8, go for it, always go for the best. matters a lot with vision models, never quantizie your kv cache, those always at f16.
you can always try evals and see if you have a q6 or q4 that can perform better than your q8. for smaller models i go q8. for bigger ones when i run out of memory I then go q6/q6/q4 and sometimes q3. i run deepseek/kimi-q4 for example.
I suggest for beginners to start with q8 so they can get the best quality and not be disappointed. it's simple to use q8 if you have the memory, choice fatigue and confusion comes in once you start trying to pick other quants...