Comment by cpburns2009
5 hours ago
~25-26 tok/s with ROCm using the same card, llama.cpp b8884:
$ llama-batched-bench -dev ROCm1 -hf unsloth/Qwen3.6-27B-GGUF:IQ4_XS -npp 1000,2000,4000,8000,16000,32000 -ntg 128 -npl 1 -c 34000
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
| 1000 | 128 | 1 | 1128 | 1.034 | 966.90 | 4.851 | 26.39 | 5.885 | 191.67 |
| 2000 | 128 | 1 | 2128 | 2.104 | 950.38 | 4.853 | 26.38 | 6.957 | 305.86 |
| 4000 | 128 | 1 | 4128 | 4.269 | 937.00 | 4.876 | 26.25 | 9.145 | 451.40 |
| 8000 | 128 | 1 | 8128 | 8.962 | 892.69 | 4.912 | 26.06 | 13.873 | 585.88 |
| 16000 | 128 | 1 | 16128 | 19.673 | 813.31 | 4.996 | 25.62 | 24.669 | 653.78 |
| 32000 | 128 | 1 | 32128 | 46.304 | 691.09 | 5.122 | 24.99 | 51.426 | 624.75 |
No comments yet
Contribute on Hacker News ↗