Comment by tasuki

2 months ago

How does one choose between "fewer parameters and less quantization" vs "more parameters and more quantization" ?

There were some benchmarks a few years ago from, IIRC, the people behind either llama.cpp or Ollama (I forget which).

The basic rule of thumb is that more parameters is always better, with diminishing returns as you get down to 2-3 bits per parameter. This is purely based on model quality, not inference speed.

just finding the perfect spot between accuracy of the answers/available VRAM/tokens per second