Comment by simonw
9 hours ago
I got this running locally using llama.cpp from Homebrew and the Unsloth quantized model like this:
brew upgrade llama.cpp # or brew install if you don't have it yet
Then:
llama-cli \
-hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
--fit on \
--seed 3407 \
--temp 1.0 \
--top-p 0.95 \
--min-p 0.01 \
--top-k 40 \
--jinja
That opened a CLI interface. For a web UI on port 8080 along with an OpenAI chat completions compatible endpoint do this:
llama-server \
-hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
--fit on \
--seed 3407 \
--temp 1.0 \
--top-p 0.95 \
--min-p 0.01 \
--top-k 40 \
--jinja
It's using about 28GB of RAM.
what's the token per seconds speed?
what are your impressions?
I got Codex CLI running against it and was sadly very unimpressed - it got stuck in a loop running "ls" for some reason when I asked it to create a new file.
Yes sadly that sometimes happens - the issue is Codex CLI / Claude Code were designed for GPT / Claude models specifically, so it'll be hard for OSS models directly to utilize the full spec / tools etc, and might get loops sometimes - I would maybe try the MXFP4_MOE quant to see if it helps, and maybe try Qwen CLI (was planning to make a guide for it as well)
I guess until we see the day OSS models truly utilize Codex / CC very well, then local models will really take off