← Back to context

Comment by genpfault

11 hours ago

Getting ~36-33 tok/s (see the "S_TG t/s" column) on a 24GB Radeon RX 7900 XTX using llama.cpp's Vulkan backend:

    $ llama-server --version
    version: 8851 (e365e658f)

    $ llama-batched-bench -hf unsloth/Qwen3.6-27B-GGUF:IQ4_XS -npp 1000,2000,4000,8000,16000,32000 -ntg 128 -npl 1 -c 34000
    |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
    |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
    |  1000 |    128 |    1 |   1128 |    1.529 |   654.11 |    3.470 |    36.89 |    4.999 |   225.67 |
    |  2000 |    128 |    1 |   2128 |    3.064 |   652.75 |    3.498 |    36.59 |    6.562 |   324.30 |
    |  4000 |    128 |    1 |   4128 |    6.180 |   647.29 |    3.535 |    36.21 |    9.715 |   424.92 |
    |  8000 |    128 |    1 |   8128 |   12.477 |   641.16 |    3.582 |    35.73 |   16.059 |   506.12 |
    | 16000 |    128 |    1 |  16128 |   25.849 |   618.98 |    3.667 |    34.91 |   29.516 |   546.42 |
    | 32000 |    128 |    1 |  32128 |   57.201 |   559.43 |    3.825 |    33.47 |   61.026 |   526.47 |

Getting ~44-40 tok/s on 24GB RTX 3090 (llama.cpp version 8884, same llama-batched-bench call):

    |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
    |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
    |  1000 |    128 |    1 |   1128 |    0.684 |  1462.61 |    2.869 |    44.61 |    3.553 |   317.47 |
    |  2000 |    128 |    1 |   2128 |    1.390 |  1438.84 |    2.868 |    44.64 |    4.258 |   499.80 |
    |  4000 |    128 |    1 |   4128 |    2.791 |  1433.18 |    2.886 |    44.35 |    5.677 |   727.11 |
    |  8000 |    128 |    1 |   8128 |    5.646 |  1416.98 |    2.922 |    43.80 |    8.568 |   948.65 |
    | 16000 |    128 |    1 |  16128 |   11.851 |  1350.10 |    3.007 |    42.57 |   14.857 |  1085.51 |
    | 32000 |    128 |    1 |  32128 |   25.855 |  1237.66 |    3.168 |    40.40 |   29.024 |  1106.96 |

Edit: Model gets stuck in infinite loops at this quantization level. I've also tried Q5_K_M quantization (fits up to 51968 context length), which seems more robust.

~25-26 tok/s with ROCm using the same card, llama.cpp b8884:

    $ llama-batched-bench -dev ROCm1 -hf unsloth/Qwen3.6-27B-GGUF:IQ4_XS -npp 1000,2000,4000,8000,16000,32000 -ntg 128 -npl 1 -c 34000
    |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
    |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
    |  1000 |    128 |    1 |   1128 |    1.034 |   966.90 |    4.851 |    26.39 |    5.885 |   191.67 |
    |  2000 |    128 |    1 |   2128 |    2.104 |   950.38 |    4.853 |    26.38 |    6.957 |   305.86 |
    |  4000 |    128 |    1 |   4128 |    4.269 |   937.00 |    4.876 |    26.25 |    9.145 |   451.40 |
    |  8000 |    128 |    1 |   8128 |    8.962 |   892.69 |    4.912 |    26.06 |   13.873 |   585.88 |
    | 16000 |    128 |    1 |  16128 |   19.673 |   813.31 |    4.996 |    25.62 |   24.669 |   653.78 |
    | 32000 |    128 |    1 |  32128 |   46.304 |   691.09 |    5.122 |    24.99 |   51.426 |   624.75 |