Comment by lambda

3 months ago

You don't even need to go this expensive. An AMD Ryzen Strix Halo (AI Max+ 395) machine with 128 GiB of unified RAM will set you back about $2500 these days. I can get about 20 tokens/s on Qwen3 Coder Next at an 8 bit quant, or 17 tokens per second on Minimax M2.5 at a 3 bit quant.

Now, these models are a bit weaker, but they're in the realm of Claude Sonnet to Claude Opus 4. 6-12 months behind SOTA on something that's well within a personal hobby budget.

I was testing the 4-bit Qwen3 Coder Next on my 395+ board last night. IIRC it was maintaining around 30 tokens a second even with a large context window.

I haven't tried Minimax M2.5 yet. How do its capabilities compare to Qwen3 Coder Next in your testing?

I'm working on getting a good agentic coding workflow going with OpenCode and I had some issues with the Qwen model getting stuck in a tool calling loop.

  • I've literally just gotten Minimax M2.5 set up, the only test I've done is the "car wash" test that has been popular recently: https://mastodon.world/@knowmadd/116072773118828295

    Minimax passed this test, which even some SOTA models don't pass. But I haven't tried any agentic coding yet.

    I wasn't able to allocate the full context length for Minimax with my current setup, I'm going to try quantizing the KV cache to see if I can fit the full context length into the RAM I've allocated to the GPU. Even at a 3 bit quant MiniMax is pretty heavy. Need to find a big enough context window, otherwise it'll be less useful for agentic coding. With Qwen3 Coder Next, I can use the full context window.

    Yeah, I've also seen the occasional tool call looping in Qwen3 Coder Next, that seems to be an easy failure mode for that model to hit.

    • OK, with MiniMax M2.5 UD-Q3_K_XL (101 GiB), I can't really seem to fit the full context in even at smaller quants. Going up much above 64k tokens, I start to get OOM errors when running Firefox and Zed alongside the model, or just failure to allocate the buffers, even going down to 4 bit KV cache quants (oddly, 8 bit worked better than 4 or 5 bit, but I still ran into OOM errors).

      I might be able to squeeze a bit more out if I were running fully headless with my development on another machine, but I'm running everything on a single laptop.

      So looks like for my setup, 64k context with an 8 bit quant is about as good as I can do, and I need to drop down to a smaller model like Qwen3 Coder Next or GPT-OSS 120B if I want to be able to use longer contexts.

      2 replies →

It is crazy to me that it is that slow, 4 bit quants don't lose much with Qwen3 coder next and unsloth/Qwen3-Coder-Next-UD-Q4_K_XL gets 32 tps with a 3090 (24gb) as a VM with 256k context size with llama.cpp

Same with unsloth/gpt-oss-120b-GGUF:F16 gets 25 tps and gpt-oss20b gets 195 tps!!!

The advantage is that you can use the APU for booting, and pass through the GPU to a VM, and have nice safer VMs for agents at the same time while using DDR4 IMHO.

  • Yeah, this is an AMD laptop integrated GPU, not a discrete NVIDIA GPU on a desktop. Also, I haven't really done much to try tweaking performance, this is just the first setup I've gotten that works.

    • The memory bandwidth of the Laptop CPU is better for fine tuning, but MoE really works well for inference.

      I won’t use a public model for my secret sauce, no reason to help the foundation models on my secret sauce.

      Even an old 1080ti works well for FIM for IDEs.

      IMHO the above setup works well for boilerplate and even the sota models fail for the domain specific portions.

      While I lucked out and foresaw the huge price increases, you can still find some good deals. Old gaming computers work pretty well, especially if you have Claude code locally churn on the boring parts while you work on the hard parts.

      1 reply →

If you don't mind saying, what distro and/or Docker container are you using to bet Qwen3 Coder Next going?

  • I'm running Fedora Silverblue as my host OS, this is the kernel:

      $ uname -a
      Linux fedora 6.18.9-200.fc43.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Feb  6 21:43:09 UTC 2026 x86_64 GNU/Linux
    

    You also need to set a few kernel command line paramters to set it up to allow it to use most of your memory as graphics memory, I have the following in my kernel command line, those are each 110 GiB expressed in number of pages (I figure leaving 18 GiB or so for CPU memory is probably a good idea):

      ttm.pages_limit=28835840 ttm.page_pool_size=28835840
    

    Then I'm running llama.cpp in the official llama.cpp Docker containers. The Vulkan one works out of the box. I had to build the container myself for ROCm, the llama.cpp container has ROCm 7.0 but I need 7.2 to be compatible with my kernel. I haven't actually compared the speed directly between Vulkan and ROCm yet, I'm pretty much at the point where I've just gotten everything working.

    In a checkout of the llama.cpp repo:

      podman build -t llama.cpp-rocm7.2 -f .devops/rocm.Dockerfile --build-arg ROCM_VERSION=7.2 --build-arg ROCM_DOCKER_ARCH='gfx1151' .
    

    Then I run the container with something like:

      podman run -p 8080:8080 --device /dev/kfd --device /dev/dri --security-opt seccomp=unconfined --security-opt label=disable --rm -it -v ~/.cache/llama.cpp/:/root/.cache/llama.cpp/ -v ./unsloth:/app/unsloth llama.cpp-rocm7.2  --model unsloth/MiniMax-M2.5-GGUF/UD-Q3_K_XL/MiniMax-M2.5-UD-Q3_K_XL-00001-of-00004.gguf --jinja --ctx-size 16384 --seed 3407 --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --port 8080 --host 0.0.0.0 -dio
    

    Still getting my setup dialed in, but this is working for now.

    Edit: Oh, yeah, you had asked about Qwen3 Coder Next. That command was:

      podman run -p 8080:8080 --device /dev/kfd --device /dev/dri --security-opt seccomp=unconfined --security-opt label=disable \
        --rm -it -v ~/.cache/llama.cpp/:/root/.cache/llama.cpp/ -v ./unsloth:/app/unsloth llama.cpp-rocm7.2  -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q6_K_XL \
        --jinja --ctx-size 262144 --seed 3407 --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --port 8080 --host 0.0.0.0 -dio
    

    (as mentioned, still just getting this set up so I've been moving around between using `-hf` to pull directly from HuggingFace vs. using `uvx hf download` in advance, sorry that these commands are a bit messy, the problem with using `-hf` in llama.cpp is that you'll sometimes get surprise updates where it has to download many gigabytes before starting up)