Comment by spwa4

8 hours ago

Unsloth quants available:

https://unsloth.ai/docs/models/qwen3.6

Getting ~36-33 tok/s (see the "S_TG t/s" column) on a 24GB Radeon RX 7900 XTX using llama.cpp's Vulkan backend:

    $ llama-server --version
    version: 8851 (e365e658f)

    $ llama-batched-bench -hf unsloth/Qwen3.6-27B-GGUF:IQ4_XS -npp 1000,2000,4000,8000,16000,32000 -ntg 128 -npl 1 -c 34000
    |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
    |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
    |  1000 |    128 |    1 |   1128 |    1.529 |   654.11 |    3.470 |    36.89 |    4.999 |   225.67 |
    |  2000 |    128 |    1 |   2128 |    3.064 |   652.75 |    3.498 |    36.59 |    6.562 |   324.30 |
    |  4000 |    128 |    1 |   4128 |    6.180 |   647.29 |    3.535 |    36.21 |    9.715 |   424.92 |
    |  8000 |    128 |    1 |   8128 |   12.477 |   641.16 |    3.582 |    35.73 |   16.059 |   506.12 |
    | 16000 |    128 |    1 |  16128 |   25.849 |   618.98 |    3.667 |    34.91 |   29.516 |   546.42 |
    | 32000 |    128 |    1 |  32128 |   57.201 |   559.43 |    3.825 |    33.47 |   61.026 |   526.47 |

  • Getting ~44-40 tok/s on 24GB RTX 3090 (llama.cpp version 8884, same llama-batched-bench call):

        |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
        |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
        |  1000 |    128 |    1 |   1128 |    0.684 |  1462.61 |    2.869 |    44.61 |    3.553 |   317.47 |
        |  2000 |    128 |    1 |   2128 |    1.390 |  1438.84 |    2.868 |    44.64 |    4.258 |   499.80 |
        |  4000 |    128 |    1 |   4128 |    2.791 |  1433.18 |    2.886 |    44.35 |    5.677 |   727.11 |
        |  8000 |    128 |    1 |   8128 |    5.646 |  1416.98 |    2.922 |    43.80 |    8.568 |   948.65 |
        | 16000 |    128 |    1 |  16128 |   11.851 |  1350.10 |    3.007 |    42.57 |   14.857 |  1085.51 |
        | 32000 |    128 |    1 |  32128 |   25.855 |  1237.66 |    3.168 |    40.40 |   29.024 |  1106.96 |
    

    Edit: Model gets stuck in infinite loops at this quantization level. I've also tried Q5_K_M quantization (fits up to 51968 context length), which seems more robust.

  • ~25-26 tok/s with ROCm using the same card, llama.cpp b8884:

        $ llama-batched-bench -dev ROCm1 -hf unsloth/Qwen3.6-27B-GGUF:IQ4_XS -npp 1000,2000,4000,8000,16000,32000 -ntg 128 -npl 1 -c 34000
        |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
        |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
        |  1000 |    128 |    1 |   1128 |    1.034 |   966.90 |    4.851 |    26.39 |    5.885 |   191.67 |
        |  2000 |    128 |    1 |   2128 |    2.104 |   950.38 |    4.853 |    26.38 |    6.957 |   305.86 |
        |  4000 |    128 |    1 |   4128 |    4.269 |   937.00 |    4.876 |    26.25 |    9.145 |   451.40 |
        |  8000 |    128 |    1 |   8128 |    8.962 |   892.69 |    4.912 |    26.06 |   13.873 |   585.88 |
        | 16000 |    128 |    1 |  16128 |   19.673 |   813.31 |    4.996 |    25.62 |   24.669 |   653.78 |
        | 32000 |    128 |    1 |  32128 |   46.304 |   691.09 |    5.122 |    24.99 |   51.426 |   624.75 |

128GB (112 GB avail) Strix AI 395+ Radeon 8060x (gfx1151)

llama-* version 8889 w/ rocm support ; nightly rocm

llama.cpp/build/bin/llama-batched-bench --version unsloth/Qwen3.6-27B-GGUF:UD-Q8_K_XL -npp 1000,2000,4000,8000,16000,32000 -ntg 128 -npl 1 -c 34000

    |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
    |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
    |  1000 |    128 |    1 |   1128 |    2.776 |   360.22 |   20.192 |     6.34 |   22.968 |    49.11 |
    |  2000 |    128 |    1 |   2128 |    5.778 |   346.12 |   20.211 |     6.33 |   25.990 |    81.88 |
    |  4000 |    128 |    1 |   4128 |   11.723 |   341.22 |   20.291 |     6.31 |   32.013 |   128.95 |
    |  8000 |    128 |    1 |   8128 |   24.223 |   330.26 |   20.399 |     6.27 |   44.622 |   182.15 |
    | 16000 |    128 |    1 |  16128 |   52.521 |   304.64 |   20.669 |     6.19 |   73.190 |   220.36 |
    | 32000 |    128 |    1 |  32128 |  120.333 |   265.93 |   21.244 |     6.03 |  141.577 |   226.93 |

More directly comparable to the results posted by genpfault (IQ4_XS):

llama.cpp/build/bin/llama-batched-bench -hf unsloth/Qwen3.6-27B-GGUF:IQ4_XS -npp 1000,2000,4000,8000,16000,32000 -ntg 128 -npl 1 -c 34000

    |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
    |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
    |  1000 |    128 |    1 |   1128 |    2.543 |   393.23 |    9.829 |    13.02 |   12.372 |    91.17 |
    |  2000 |    128 |    1 |   2128 |    5.400 |   370.36 |    9.891 |    12.94 |   15.291 |   139.17 |
    |  4000 |    128 |    1 |   4128 |   10.950 |   365.30 |    9.972 |    12.84 |   20.922 |   197.31 |
    |  8000 |    128 |    1 |   8128 |   22.762 |   351.46 |   10.118 |    12.65 |   32.880 |   247.20 |
    | 16000 |    128 |    1 |  16128 |   49.386 |   323.98 |   10.387 |    12.32 |   59.773 |   269.82 |
    | 32000 |    128 |    1 |  32128 |  114.218 |   280.16 |   10.950 |    11.69 |  125.169 |   256.68 |

  • Results are nearly identical running on a Strix Halo using Vulkan, llama.cpp b8884:

        $ llama-batched-bench -dev Vulkan2 -hf unsloth/Qwen3.6-27B-GGUF:IQ4_XS -npp 1000,2000,4000,8000,16000,32000 -ntg 128 -npl 1 -c 34000
        |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
        |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
        |  1000 |    128 |    1 |   1128 |    3.288 |   304.15 |    9.873 |    12.96 |   13.161 |    85.71 |
        |  2000 |    128 |    1 |   2128 |    6.415 |   311.79 |    9.883 |    12.95 |   16.297 |   130.57 |
        |  4000 |    128 |    1 |   4128 |   13.113 |   305.04 |    9.979 |    12.83 |   23.092 |   178.76 |
        |  8000 |    128 |    1 |   8128 |   27.491 |   291.01 |   10.155 |    12.61 |   37.645 |   215.91 |
        | 16000 |    128 |    1 |  16128 |   59.079 |   270.83 |   10.476 |    12.22 |   69.555 |   231.87 |
        | 32000 |    128 |    1 |  32128 |  148.625 |   215.31 |   11.084 |    11.55 |  159.709 |   201.17 |

  • you should try vulkan instead of rocm. it goes like 20% faster.

    • Is that based on recent experience? With "stable" ROCm, or the (IMHO better) releases from TheRock? With older or more recent hardware? The AMD landscape is rather uneven.

llama-batched-bench -hf ggml-org/Qwen3.6-27B-GGUF -npp 512,1024,2048,4096,8192,16384,32768 -ntg 128 -npl 1 -c 36000

M2 Ultra, Q8_0

  |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
  |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
  |   512 |    128 |    1 |    640 |    1.307 |   391.69 |    6.209 |    20.61 |    7.516 |    85.15 |
  |  1024 |    128 |    1 |   1152 |    2.534 |   404.16 |    6.227 |    20.56 |    8.760 |   131.50 |
  |  2048 |    128 |    1 |   2176 |    5.029 |   407.26 |    6.229 |    20.55 |   11.258 |   193.29 |
  |  4096 |    128 |    1 |   4224 |   10.176 |   402.52 |    6.278 |    20.39 |   16.454 |   256.72 |
  |  8192 |    128 |    1 |   8320 |   20.784 |   394.14 |    6.376 |    20.08 |   27.160 |   306.33 |
  | 16384 |    128 |    1 |  16512 |   43.513 |   376.53 |    6.532 |    19.59 |   50.046 |   329.94 |
  | 32768 |    128 |    1 |  32896 |   99.137 |   330.53 |    7.081 |    18.08 |  106.218 |   309.70 |

DGX Spark, Q8_0

  |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
  |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
  |   512 |    128 |    1 |    640 |    0.881 |   580.98 |   16.122 |     7.94 |   17.003 |    37.64 |
  |  1024 |    128 |    1 |   1152 |    1.749 |   585.43 |   16.131 |     7.93 |   17.880 |    64.43 |
  |  2048 |    128 |    1 |   2176 |    3.486 |   587.54 |   16.169 |     7.92 |   19.655 |   110.71 |
  |  4096 |    128 |    1 |   4224 |    7.018 |   583.64 |   16.245 |     7.88 |   23.263 |   181.58 |
  |  8192 |    128 |    1 |   8320 |   14.189 |   577.33 |   16.427 |     7.79 |   30.617 |   271.75 |
  | 16384 |    128 |    1 |  16512 |   29.015 |   564.68 |   16.749 |     7.64 |   45.763 |   360.81 |
  | 32768 |    128 |    1 |  32896 |   60.413 |   542.40 |   17.359 |     7.37 |   77.772 |   422.98 |