Comment by spmurrayzzz
11 days ago
I've tested this myself often (as an aside: I'm in said community, I run 2x RTX Pro 6000 locally, 4x 3090 before that), and I think what you said re: "willing to wait" is probably the difference maker for me.
I can run Minimax 2.1 in 5bpw at 200k context fully offloaded to GPU. The 30-40 tk/s feels like a lifetime for long horizon tasks, especially with subagent delegation etc, but it's still fast enough to be a daily driver.
But that's more or less my cutoff. Whenever I've tested other setups that dip into the single and sub-single digit throughput rates, it becomes maddening and entirely unusable (for me).
What is bpw?
Bits per weight, its an average precision across all the weights. When you quantize these models, they don't just used a fixed precision size across all model layers/weights. There's a mix and it varies per quant method. This is why you can get bit precision that arent "real" in a strict computing sense.
e.g. A 4-bit quant can have half the attention and feed forward tensors in Q6, and the rest in Q4. Due to how block-scaling works, those k-quant dtypes (specifically for llama.cpp/gguf) have larger bpw than they suggest in their name. Q4 is around ~4.5 bpw, and Q6 is ~6.5.