Comment by simonw
15 hours ago
There have been some very interesting experiments with streaming from SSD recently: https://simonwillison.net/2026/Mar/18/llm-in-a-flash/
15 hours ago
There have been some very interesting experiments with streaming from SSD recently: https://simonwillison.net/2026/Mar/18/llm-in-a-flash/
I don't mean to be a jerk, but 2-bit quant, reducing experts from 10 to 4, who knows if the test is running long enough for the SSD to thermal throttle, and still only getting 5.5 tokens/s does not sound useful to me.
It's a lot more useful than being entirely unable to try out the model.
But you aren't trying out the model. You quantized beyond what people generally say is acceptable, and reduced the number of experts, which these models are not designed for.
Even worse, the github repo advertises:
> Pure C/Metal inference engine that runs Qwen3.5-397B-A17B (a 397 billion parameter Mixture-of-Experts model) on a MacBook Pro with 48GB RAM at 4.4+ tokens/second with production-quality output including tool calling.
Which implies it is running _17_ experts per token, but they reduced it to _4_.