Comment by tasuki
6 days ago
How does one choose between "fewer parameters and less quantization" vs "more parameters and more quantization" ?
6 days ago
How does one choose between "fewer parameters and less quantization" vs "more parameters and more quantization" ?
There were some benchmarks a few years ago from, IIRC, the people behind either llama.cpp or Ollama (I forget which).
The basic rule of thumb is that more parameters is always better, with diminishing returns as you get down to 2-3 bits per parameter. This is purely based on model quality, not inference speed.
just finding the perfect spot between accuracy of the answers/available VRAM/tokens per second
Ok, say I have 14GB VRAM. What is the tradeoff between using 9B with 8-bit params vs 27B with 3-bit params?
You COULD even do Qwen3.5-35B-A3B-GGUF.
UD-IQ3-XXS is only 13.1GB, which might outperform both in both intelligence and certainly speed (only 3B activated): https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF
To accommodate cache you will need to offload a few feed-forward layers to the CPU. Will still be quite fast.
Edit: Actually 27B does a little better than 35B on most benchmarks- 35B will still be much faster.
3-bit 27B will almost certainly be better. 4-bits is usually the limit below-which you start to see more steep drop-offs, but you also get diminishing returns above 6-bits. So I'd still rather pack in more params at 3-bits.
9B will be faster, however.
See also: https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks