Comment by wills_forward

1 year ago

So this could universally decrease the memory requirements by un-quantitized LLMs by 30%? Seems big if true.

18 comments

wills_forward

Not as big when Q8 quantization is already considered overkill and cuts it down to 50% (and a flat 2x speed boost without any additional compute overhead mind you) and the more common Q4KM is more like 30%. Definitely interesting if it can be added to existing quantization, but K quants do already use different precision levels for different layers depending on general perplexity impact which is similar to this entropy metric they use, e.g. Q6 using a mix of 4 bits and 8 bits. And that's not even considering calibrated imatrix which does something conceptually similar to FFT to compress even higher.

janalsncm 1 year ago
Quantization is not lossless.
- danielmarkbruce 1 year ago
  
  Nobody really cares if it meets a strict definition of lossless.
  
  15 replies →