Comment by someone13
11 hours ago
Out of curiosity, do you have any theories of why it works so well at such aggressive quantization levels?
11 hours ago
Out of curiosity, do you have any theories of why it works so well at such aggressive quantization levels?
It's a mix of extreme sparsity but with the routed expert doing a non trivial amount of work (and it is q8), and projections and routing not being quantized as well. Also the fact it's a QAT model must have a role I guess, and I quantized routed experts out layers with Q2 instead of IQ2_XXS to retain quality.