Comment by androiddrew

17 hours ago

Dual AMD Radeon AI Pro 9700s (600 watts total 64GB of vram) runs Qwen 3.6 27B at FP8 with mtp on vLLM at 50ish TPS for decode. Cards cost $1300 a piece. Enough KV cache to fully max out two concurrent sessions.

It was super rough going to get started with them back in January, but right now the cards purrrr and I haven't even tried tuning yet. You need to use a patched vLLM image with aiter but besides that things are finally working on the ROCm front.

Agreed. I have a single 9700 and I'm able to fit Q6 27B at 30tps or Q5 35B at 100tps very easily via llamacpp running vulkan.

The results are impressive considering the amount of people trashing AMD and still trying to recommend 3090s. I hope to buy a 2nd one at some point, but I also hate the version hell of vLLM, the R9700, the ROCM version, and Qwen3.6 all not agreeing with each other. I haven't gotten vLLM to run properly for Qwen3.6, since the version that runs on a 9700 doesn't support 3.6 yet.

I'm trying to quickly hack out a optimized path for just Qwen3.6 to run against rocm natively (e.g. my own inference server for 9700s basically) and see if it can perform better than llamacpp vulkan's results.

Word of caution - the last llamacpp with good performance was b9209 from a month ago. After that, for some reason, vulkan performance dropped by 10x, which has made me lose confidence in llamacpp in the long run.

Having said all that, 3x is 96GB for 4k and peak 900 watts. A 96GB Blackwell is $12k and peak 600 watss. And they will have a similar memory throughput (minor negative to the AMD cards for split processing). It's crazy how price efficient the r9700 is compared to the Nvidia cards.

  • I'm getting around 45 tps on a single r9700 for Q6 27B with build b9811 ( using https://github.com/kyuz0/amd-r9700-ai-toolboxes ) with the following parameters:

    llama-server -hf unsloth/Qwen3.6-27B-MTP-GGUF:Q6_K -c 135000 -ngl 999 -np 2 -t 16 --temp 0.0 --top-p 0.95 --top-k 20 --min-p 0.00 -b 4096 -ub 4096 --chat-template-kwargs '{"preserve_thinking": true}' -fa 1 --spec-type draft-mtp --spec-draft-n-max 2

    • I'll give 27B-MTP a try. I think I can tolerate 45 tps if the results are technically better. 35B is pretty good, but definitely shows it's inabilities at times (probably either due to the heavy caching quantization I'm doing, or the heavy model quantization vs what 2 GPUs could run).

      My biggest gripe is that both pi and opencode seem to have trouble parsing the thinking blocks at times, and the model sometimes cuts-off mid-thinking or prints out weird character tokens at times. I don't know if that's because of llamacpp, pi/opencode, or qwen3.6, or some weird combination of them all, as I haven't investigated that problem fully yet.