Comment by simonw

3 hours ago

It doesn't have to be a 2-bit quant - see the update at the bottom of my post:

> Update: Dan's latest version upgrades to 4-bit quantization of the experts (209GB on disk, 4.36 tokens/second) after finding that the 2-bit version broke tool calling while 4-bit handles that well.

That was also just the first version of this pattern that I encountered, it's since seen a bunch of additional activity from other developers in other projects.

I linked to some of those in this follow-up: https://simonwillison.net/2026/Mar/24/streaming-experts/