Comment by tasuki

6 days ago

How does one choose between "fewer parameters and less quantization" vs "more parameters and more quantization" ?

There were some benchmarks a few years ago from, IIRC, the people behind either llama.cpp or Ollama (I forget which).

The basic rule of thumb is that more parameters is always better, with diminishing returns as you get down to 2-3 bits per parameter. This is purely based on model quality, not inference speed.

just finding the perfect spot between accuracy of the answers/available VRAM/tokens per second

  • Ok, say I have 14GB VRAM. What is the tradeoff between using 9B with 8-bit params vs 27B with 3-bit params?