Comment by briga
2 hours ago
I was getting dangerously close to my weekly Claude Code limit last night so I had Claude set up Qwen3.6 with llama.cpp and OpenCode. Honestly it's a great (free!) alternative to Claude Code--certainly more than good enough for a lot of smaller less complex tasks. I'm excited to try this new version. The fact that open-source models are so close to the frontier is very impressive.
Out of interest, what machine and model are you running it on?
I tried the qwen3.6-27b Q6_k GUFF in llama.cpp and LM Studio on my M2 MacBook Pro 32GB machine last week, and I barely get a token a second with either.
What sort of speed should I be expecting?
I tried some of the Llama 3 34b (nous-capybara?) models two years ago with llama.cpp, and I seem to remember getting a few tokens a second then, so not sure if I've got something completely mis-configured, or I just have unreasonable expectations.
Or maybe qwen 3.x is slower for some reason? (Is it mixture of experts?)
I'm not expecting it to be instant, but what I'm currently seeing is not really usable.
Which exact model are you using? And with which parameters and quant? And on what hardware? Are you using any specific MCPs or other tools to optimize performance like context-mode or dynamic context pruning? I’ve used local models a reasonable amount before but I’m just starting out with opencode. Haven’t had great results yet but really want this to work for simpler tasks. My opencode newly installed is also having iterm on 100% cpu in idle. :/
I'm running Qwen3.6:27b Q4 KM on a 4090 and similarly fast CPU and I think 32GB of RAM. Make sure the context window is set to be big enough otherwise the conversation will keep compacting. No special MCP tools set up yet. Qwen is able to do web search out-of-the-box although I think it is getting blocked by anti-bot firewalls--I still need to figure out if I can fix that.
here's a simple setup to get you started on an Apple M1 Max from 2021 with 32GB VRAM. it will download 20GB of models to `~/.cache/huggingface/hub`, which you can delete when you're done.
I don't recommend ollama or lm-studio. Ollama's in the process of switching from their llama-cpp backend anyway, but their new go framework frequently OOMs and crashes on my hardware. I also don't recommend MLX-based inference backends on this hardware; I've found them to consistently reduce performance, contrary to what I've read online. I've tried all the llama-cpp metal forks, but right now, MTP, TurboQuant, MLX, etc etc etc are too new and just slow things down. It's all dust in the wind still.
For agent harnesses, opencode is okay, as is pi or even Zed's built in agent panel. Claude code "works" with ANTHROPIC_BASE_URL=http://localhost:8080/v1, but is very chatty (the default system prompt burns 20k tokens). Crush (from the charm-bracelet folks) is particularly nice when starting out. I've personally converged on pi-agent under an otherwise-mostly-default setup. You can ask qwen to customize pi or write you an extension which helps a little.
You'll need to add `http://localhost:8080/v1` as an OpenAI-compatible model provider in your coding harness with any API key (doesn't matter) and any model identifier (doesn't matter with llama-cpp).
Note that pi doesn't have permissions. Everything is permitted. The hundred hungry ghosts you've trapped in a jar WILL find a way to delete your home folder someday. That's what Man gets for summoning demons without casting a circle of protection first. Flying too close to the sun etc etc etc
Take backups and then go have fun. Hope this helps.
This one doesnt seem to be open source though sadly. Using chinese servers is a step to far for me personally
Qwen Max are usually closed, unfortunately.
Qwen3.6 with claude code works great. I get a lot better results with that than opencode and qwen3.6. Claude Code is a great harness, and good harness/tool integration makes a big difference. You just have a settings.json with your ollama setup and the qwen model and you can use it.