Show HN: Running Gemma-4 26B at 124 tokens/SEC on a CPU, no GPU
5 hours ago (apeg.dev)
I wanted to know how fast a 26B mixture-of-experts model could run on a desktop CPU with no GPU. Got ~40 tok/s single-stream (lossless) and ~124 batched. The surprising part was the byte budget: for this model you compress the output head (32% of per-token bytes), not the experts (16%). The writeup has the bandwidth roofline and the dead-ends; the repo has the reproducible recipe. Happy to answer questions.
The output head byte budget is surprising. Did you try any tradeoff where the head is compressed more aggressively but experts stay mostly untouched?