Comment by dust42
5 days ago
I am actually doing now a good part of dev with Qwen3-Coder-Next on an M1 64GB with Qwen Code CLI (a fork of Gemini CLI). I very much like
a) to have an idea how much tokens I use and
b) be independent of VC financed token machines and
c) I can use it on a plane/train
Also I never have to wait in a queue, nor will I be told to wait for a few hours. And I get many answers in a second.
I don't do full vibe coding with a dozen agents though. I read all the code it produces and guide it where necessary.
Last not least, at some point the VC funded party will be over and when this happens one better knows how to be highly efficient in AI token use.
How much tokens per seconds are you getting ?
Whats the advantage of qwen code cli over opencode ?
320 tok/s PP and 42 tok/s TG with 4bit quant and MLX. Llama.cpp was half for this model but afaik has improved a few days ago, I haven't yet tested though.
I have tried many tools locally and was never really happy with any. I tried finally Qwen Code CLI assuming that it would run well with a Qwen model and it does. YMMV, I mostly do javascript and Python. Most important setting was to set the max context size, it then auto compacts before reaching it. I run with 65536 but may raise this a bit.
Last not least OpenCode is VC funded, at some point they will have to make money while Gemini CLI / Qwen CLI are not the primary products of the companies but definitely dog-fooded.
Works for me, but sometimes there's an issue with the tool template from Qwen, past chats are changed, thus KV cache gets invalidated and it needs to reprocess input tokens from scratch. Doesn't happen all the time though
Btw I also get 42-60 tps on M4 Max with the MLX 4 bit quants hosted by LM Studio, which software do you use to run it ?
2 replies →