Comment by gcr
1 hour ago
here's a simple setup to get you started on an Apple M1 Max from 2021 with 32GB VRAM. it will download 20GB of models to `~/.cache/huggingface/hub`, which you can delete when you're done.
/Users/gcr/llama.cpp/build/bin/llama-server
-hf unsloth/Qwen3.6-35B-A3B-GGUF:Q4_K_M
--no-mmproj-offload
--fit on
-c 65536 # edit to taste
--reasoning on --chat-template-kwargs '{"preserve_thinking": true}'
--sleep-idle-seconds 90 # very aggressive: purge model from vram after this long
-ctk q8_0 -ctv q8_0 # Optional. Lower memory use, but lower speed. Omit if you can.
I don't recommend ollama or lm-studio. Ollama's in the process of switching from their llama-cpp backend anyway, but their new go framework frequently OOMs and crashes on my hardware. I also don't recommend MLX-based inference backends on this hardware; I've found them to consistently reduce performance, contrary to what I've read online. I've tried all the llama-cpp metal forks, but right now, MTP, TurboQuant, MLX, etc etc etc are too new and just slow things down. It's all dust in the wind still.
For agent harnesses, opencode is okay, as is pi or even Zed's built in agent panel. Claude code "works" with ANTHROPIC_BASE_URL=http://localhost:8080/v1, but is very chatty (the default system prompt burns 20k tokens). Crush (from the charm-bracelet folks) is particularly nice when starting out. I've personally converged on pi-agent under an otherwise-mostly-default setup. You can ask qwen to customize pi or write you an extension which helps a little.
You'll need to add `http://localhost:8080/v1` as an OpenAI-compatible model provider in your coding harness with any API key (doesn't matter) and any model identifier (doesn't matter with llama-cpp).
Note that pi doesn't have permissions. Everything is permitted. The hundred hungry ghosts you've trapped in a jar WILL find a way to delete your home folder someday. That's what Man gets for summoning demons without casting a circle of protection first. Flying too close to the sun etc etc etc
Take backups and then go have fun. Hope this helps.
No comments yet
Contribute on Hacker News ↗