Comment by EnPissant

4 hours ago

I don't mean to be a jerk, but 2-bit quant, reducing experts from 10 to 4, who knows if the test is running long enough for the SSD to thermal throttle, and still only getting 5.5 tokens/s does not sound useful to me.

3 comments

EnPissant

simonw 3 hours ago

It's a lot more useful than being entirely unable to try out the model.

EnPissant 2 hours ago
But you aren't trying out the model. You quantized beyond what people generally say is acceptable, and reduced the number of experts, which these models are not designed for.
Even worse, the github repo advertises:
> Pure C/Metal inference engine that runs Qwen3.5-397B-A17B (a 397 billion parameter Mixture-of-Experts model) on a MacBook Pro with 48GB RAM at 4.4+ tokens/second with production-quality output including tool calling.
Hiding the fact that active params is _not_ 17B.
- simonw 25 minutes ago
  
  It doesn't have to be a 2-bit quant - see the update at the bottom of my post:
  > Update: Dan's latest version upgrades to 4-bit quantization of the experts (209GB on disk, 4.36 tokens/second) after finding that the 2-bit version broke tool calling while 4-bit handles that well.
  That was also just the first version of this pattern that I encountered, it's since seen a bunch of additional activity from other developers in other projects.
  I linked to some of those in this follow-up: https://simonwillison.net/2026/Mar/24/streaming-experts/