Edit: Model gets stuck in infinite loops at this quantization level. I've also tried Q5_K_M quantization (fits up to 51968 context length), which seems more robust.
Is that based on recent experience? With "stable" ROCm, or the (IMHO better) releases from TheRock? With older or more recent hardware? The AMD landscape is rather uneven.
Getting ~36-33 tok/s (see the "S_TG t/s" column) on a 24GB Radeon RX 7900 XTX using llama.cpp's Vulkan backend:
Getting ~44-40 tok/s on 24GB RTX 3090 (llama.cpp version 8884, same llama-batched-bench call):
Edit: Model gets stuck in infinite loops at this quantization level. I've also tried Q5_K_M quantization (fits up to 51968 context length), which seems more robust.
~25-26 tok/s with ROCm using the same card, llama.cpp b8884:
128GB (112 GB avail) Strix AI 395+ Radeon 8060x (gfx1151)
llama-* version 8889 w/ rocm support ; nightly rocm
llama.cpp/build/bin/llama-batched-bench --version unsloth/Qwen3.6-27B-GGUF:UD-Q8_K_XL -npp 1000,2000,4000,8000,16000,32000 -ntg 128 -npl 1 -c 34000
More directly comparable to the results posted by genpfault (IQ4_XS):
llama.cpp/build/bin/llama-batched-bench -hf unsloth/Qwen3.6-27B-GGUF:IQ4_XS -npp 1000,2000,4000,8000,16000,32000 -ntg 128 -npl 1 -c 34000
Results are nearly identical running on a Strix Halo using Vulkan, llama.cpp b8884:
you should try vulkan instead of rocm. it goes like 20% faster.
Is that based on recent experience? With "stable" ROCm, or the (IMHO better) releases from TheRock? With older or more recent hardware? The AMD landscape is rather uneven.
For this model results are identical. In my experience it can go either way by up to 10%.
at this trajectory, unsloth are going to release the models BEFORE the model drop within the next weeks...
Haha :)
Do you get early access so you can prep the quants for release?
1 reply →
llama-batched-bench -hf ggml-org/Qwen3.6-27B-GGUF -npp 512,1024,2048,4096,8192,16384,32768 -ntg 128 -npl 1 -c 36000
M2 Ultra, Q8_0
DGX Spark, Q8_0