← Back to context

Comment by ThunderSizzle

8 hours ago

Agreed. I have a single 9700 and I'm able to fit Q6 27B at 30tps or Q5 35B at 100tps very easily via llamacpp running vulkan.

The results are impressive considering the amount of people trashing AMD and still trying to recommend 3090s. I hope to buy a 2nd one at some point, but I also hate the version hell of vLLM, the R9700, the ROCM version, and Qwen3.6 all not agreeing with each other. I haven't gotten vLLM to run properly for Qwen3.6, since the version that runs on a 9700 doesn't support 3.6 yet.

I'm trying to quickly hack out a optimized path for just Qwen3.6 to run against rocm natively (e.g. my own inference server for 9700s basically) and see if it can perform better than llamacpp vulkan's results.

Word of caution - the last llamacpp with good performance was b9209 from a month ago. After that, for some reason, vulkan performance dropped by 10x, which has made me lose confidence in llamacpp in the long run.

Having said all that, 3x is 96GB for 4k and peak 900 watts. A 96GB Blackwell is $12k and peak 600 watss. And they will have a similar memory throughput (minor negative to the AMD cards for split processing). It's crazy how price efficient the r9700 is compared to the Nvidia cards.

I'm getting around 45 tps on a single r9700 for Q6 27B with build b9811 ( using https://github.com/kyuz0/amd-r9700-ai-toolboxes ) with the following parameters:

llama-server -hf unsloth/Qwen3.6-27B-MTP-GGUF:Q6_K -c 135000 -ngl 999 -np 2 -t 16 --temp 0.0 --top-p 0.95 --top-k 20 --min-p 0.00 -b 4096 -ub 4096 --chat-template-kwargs '{"preserve_thinking": true}' -fa 1 --spec-type draft-mtp --spec-draft-n-max 2

  • I'll give 27B-MTP a try. I think I can tolerate 45 tps if the results are technically better. 35B is pretty good, but definitely shows it's inabilities at times (probably either due to the heavy caching quantization I'm doing, or the heavy model quantization vs what 2 GPUs could run).

    My biggest gripe is that both pi and opencode seem to have trouble parsing the thinking blocks at times, and the model sometimes cuts-off mid-thinking or prints out weird character tokens at times. I don't know if that's because of llamacpp, pi/opencode, or qwen3.6, or some weird combination of them all, as I haven't investigated that problem fully yet.