Comment by dirk94018

13 hours ago

We wrote the linuxtoaster inference engine, toasted, and are getting 400 prefill, 100 gen on a M4 Max w 128GB RAM on Qwen3-next-coder 6bit, 8bit runs too. KV caching means it feels snappy in chat mode. Local can work. For pro work, programming, I'd still prefer SOTA models, or GLM 4.7 via Cerebras.

0 comments