← Back to context

Comment by theanonymousone

9 hours ago

Thanks a lot. How about Q8 vs FP16/BF16? Have you checked them too?

Q8 quant is very minimal fall off in terms of KLD against the lab 16 bit. If you have the memory for BF16 KV-cache (which is usually easier to stomach) then the Q8 is very close. But even Q8 quant model with Q8 KV-cache is very close.

Smaller quants for the model start to fall off but more importantly, smaller KV-cache quants fall off much faster so avoid less than Q8 there.