Comment by simonw
2 hours ago
It doesn't have to be a 2-bit quant - see the update at the bottom of my post:
> Update: Dan's latest version upgrades to 4-bit quantization of the experts (209GB on disk, 4.36 tokens/second) after finding that the 2-bit version broke tool calling while 4-bit handles that well.
That was also just the first version of this pattern that I encountered, it's since seen a bunch of additional activity from other developers in other projects.
I linked to some of those in this follow-up: https://simonwillison.net/2026/Mar/24/streaming-experts/
No comments yet
Contribute on Hacker News ↗