Comment by jsnell

15 days ago

Needing less memory for inference is the entire point of quantization. Saving the disk space or having a smaller download could not justify any level of quality degradation.

Small point of order:

> entire point...smaller download could not justify...

Q4_K_M has layers and layers of consensus and polling and surveying and A/B testing and benchmarking to show there's ~0 quality degradation. Built over a couple years.

  • > Q4_K_M has ~0 quality degradation

    Llama 3.3 already shows a degradation from Q5 to Q4.

    As compression improves over the years, the effects of even Q5 quantization will begin to appear