Comment by tasuki

4 months ago

How does one choose between "fewer parameters and less quantization" vs "more parameters and more quantization" ?

6 comments

tasuki

There were some benchmarks a few years ago from, IIRC, the people behind either llama.cpp or Ollama (I forget which).

The basic rule of thumb is that more parameters is always better, with diminishing returns as you get down to 2-3 bits per parameter. This is purely based on model quality, not inference speed.

paoliniluis 4 months ago

just finding the perfect spot between accuracy of the answers/available VRAM/tokens per second

tasuki 4 months ago
Ok, say I have 14GB VRAM. What is the tradeoff between using 9B with 8-bit params vs 27B with 3-bit params?
- causal 4 months ago
  
  You COULD even do Qwen3.5-35B-A3B-GGUF.
  UD-IQ3-XXS is only 13.1GB, which might outperform both in both intelligence and certainly speed (only 3B activated): https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF
  To accommodate cache you will need to offload a few feed-forward layers to the CPU. Will still be quite fast.
  Edit: Actually 27B does a little better than 35B on most benchmarks- 35B will still be much faster.
- causal 4 months ago
  
  3-bit 27B will almost certainly be better. 4-bits is usually the limit below-which you start to see more steep drop-offs, but you also get diminishing returns above 6-bits. So I'd still rather pack in more params at 3-bits.
  9B will be faster, however.
- andai 4 months ago
  
  See also: https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks