Comment by simonw

9 hours ago

I got this running locally using llama.cpp from Homebrew and the Unsloth quantized model like this:

  brew upgrade llama.cpp # or brew install if you don't have it yet

Then:

  llama-cli \
    -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
    --fit on \
    --seed 3407 \
    --temp 1.0 \
    --top-p 0.95 \
    --min-p 0.01 \
    --top-k 40 \
    --jinja

That opened a CLI interface. For a web UI on port 8080 along with an OpenAI chat completions compatible endpoint do this:

  llama-server \
    -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
    --fit on \
    --seed 3407 \
    --temp 1.0 \
    --top-p 0.95 \
    --min-p 0.01 \
    --top-k 40 \
    --jinja

It's using about 28GB of RAM.

4 comments

simonw

nubg 7 hours ago

what's the token per seconds speed?

technotony 7 hours ago

what are your impressions?

simonw 5 hours ago
I got Codex CLI running against it and was sadly very unimpressed - it got stuck in a loop running "ls" for some reason when I asked it to create a new file.
- danielhanchen 3 hours ago
  
  Yes sadly that sometimes happens - the issue is Codex CLI / Claude Code were designed for GPT / Claude models specifically, so it'll be hard for OSS models directly to utilize the full spec / tools etc, and might get loops sometimes - I would maybe try the MXFP4_MOE quant to see if it helps, and maybe try Qwen CLI (was planning to make a guide for it as well)
  I guess until we see the day OSS models truly utilize Codex / CC very well, then local models will really take off