Comment by _qua

6 days ago

For roughly equivalent memory sizes, how does one choose between the bit depth and the model size?

As a rule of thumb the larger the model is, the more you can quantize it without losing performance, but smaller models will run faster. It usually always makes sense to pick the larger model at a lower quant, as long as the speed is acceptable. Smaller models also use a smaller KV cache, so longer contexts are more viable. It really depends on what your use case is.

Imo though, going below 4 bits for anything that's less than 70B is not worth the degradation. BF/FP16 and Q8 are usually indistinguishable except for vision encoders (mmproj) and for really small models, like under 2B.