Comment by bel8
17 hours ago
And unless I'm mistaken, the repo is about running it with 2bit quantization.
This is probably far from the raw intelligence provided by cloud providers.
Still, this shines more light on local LLMs for agentic workflows.
17 hours ago
And unless I'm mistaken, the repo is about running it with 2bit quantization.
This is probably far from the raw intelligence provided by cloud providers.
Still, this shines more light on local LLMs for agentic workflows.
It runs both q2 and original (4 bit routed experts). At the same speed more or less. The q2 quants are not what you could expect: it works extremely well for a few reasons. For the full model you need a Mac with 256GB.
Out of curiosity, do you have any theories of why it works so well at such aggressive quantization levels?
It's a mix of extreme sparsity but with the routed expert doing a non trivial amount of work (and it is q8), and projections and routing not being quantized as well. Also the fact it's a QAT model must have a role I guess, and I quantized routed experts out layers with Q2 instead of IQ2_XXS to retain quality.