Comment by spwa4

8 hours ago

Unsloth quants available:

13 comments

spwa4

Getting ~36-33 version: 8851 (e365e658f) $ llama-batched-bench | PP | |-------|--------| | 1000 | | 2000 | | 4000 | | 8000 | | 16000 | | 32000 |

    $ llama-server --version -hf unsloth/Qwen3.6-27B-GGUF:IQ4_XS -npp 1000,2000,4000,8000,16000,32000 -ntg 128 -npl 1 -c 34000 TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s | ------|--------|----------|----------|----------|----------|----------|----------| 128 |    1 |   1128 |    1.529 |   654.11 |    3.470 |    36.89 |    4.999 |   225.67 | 128 |    1 |   2128 |    3.064 |   652.75 |    3.498 |    36.59 |    6.562 |   324.30 | 128 |    1 |   4128 |    6.180 |   647.29 |    3.535 |    36.21 |    9.715 |   424.92 | 128 |    1 |   8128 |   12.477 |   641.16 |    3.582 |    35.73 |   16.059 |   506.12 | 128 |    1 |  16128 |   25.849 |   618.98 |    3.667 |    34.91 |   29.516 |   546.42 | 128 |    1 |  32128 |   57.201 |   559.43 |    3.825 |    33.47 |   61.026 |   526.47 |

johndough font-medium text-muted">6 hours ago

Getting ~44-40 tok/s on 24GB RTX 3090 (llama.cpp version 8884, same llama-batched-bench call):

    |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s | ------|--------|----------|----------|----------|----------|----------|----------| 128 |    1 |   1128 |    0.684 |  1462.61 |    2.869 |    44.61 |    3.553 |   317.47 | 128 |    1 |   2128 |    1.390 |  1438.84 |    2.868 |    44.64 |    4.258 |   499.80 | 128 |    1 |   4128 |    2.791 |  1433.18 |    2.886 |    44.35 |    5.677 |   727.11 | 128 |    1 |   8128 |    5.646 |  1416.98 |    2.922 |    43.80 |    8.568 |   948.65 | 128 |    1 |  16128 |   11.851 |  1350.10 |    3.007 |    42.57 |   14.857 |  1085.51 | 128 |    1 |  32128 |   25.855 |  1237.66 |    3.168 |    40.40 |   29.024 |  1106.96 | stuck in infinite loops at this quantization level. I've also tried Q5_K_M quantization (fits up to 51968 context length), which seems more robust.

cpburns2009 font-medium text-muted">5 hours ago

 ~25-26 tok/s with ROCm using the same card, llama.cpp b8884:
    $ llama-batched-bench -dev ROCm1 -hf unsloth/Qwen3.6-27B-GGUF:IQ4_XS -npp 1000,2000,4000,8000,16000,32000 -ntg 128 -npl 1 -c 34000 TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s | ------|--------|----------|----------|----------|----------|----------|----------| 128 |    1 |   1128 |    1.034 |   966.90 |    4.851 |    26.39 |    5.885 |   191.67 | 128 |    1 |   2128 |    2.104 |   950.38 |    4.853 |    26.38 |    6.957 |   305.86 | 128 |    1 |   4128 |    4.269 |   937.00 |    4.876 |    26.25 |    9.145 |   451.40 | 128 |    1 |   8128 |    8.962 |   892.69 |    4.912 |    26.06 |   13.873 |   585.88 | 128 |    1 |  16128 |   19.673 |   813.31 |    4.996 |    25.62 |   24.669 |   653.78 | 128 |    1 |  32128 |   46.304 |   691.09 |    5.122 |    24.99 |   51.426 |   624.75 |

GrinningFool font-medium text-muted">7 hours ago

 128GB (112 GB avail) Strix AI 395+ Radeon 8060x (gfx1151)
llama-* version 8889 w/ rocm support ; nightly rocm
llama.cpp/build/bin/llama-batched-bench  --version unsloth/Qwen3.6-27B-GGUF:UD-Q8_K_XL -npp 1000,2000,4000,8000,16000,32000 -ntg 128 -npl 1 -c 34000
    |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s | ------|--------|----------|----------|----------|----------|----------|----------| 128 |    1 |   1128 |    2.776 |   360.22 |   20.192 |     6.34 |   22.968 |    49.11 | 128 |    1 |   2128 |    5.778 |   346.12 |   20.211 |     6.33 |   25.990 |    81.88 | 128 |    1 |   4128 |   11.723 |   341.22 |   20.291 |     6.31 |   32.013 |   128.95 | 128 |    1 |   8128 |   24.223 |   330.26 |   20.399 |     6.27 |   44.622 |   182.15 | 128 |    1 |  16128 |   52.521 |   304.64 |   20.669 |     6.19 |   73.190 |   220.36 | 128 |    1 |  32128 |  120.333 |   265.93 |   21.244 |     6.03 |  141.577 |   226.93 | comparable to the results posted by genpfault (IQ4_XS):llama.cpp/build/bin/llama-batched-bench -hf unsloth/Qwen3.6-27B-GGUF:IQ4_XS -npp 1000,2000,4000,8000,16000,32000 -ntg 128 -npl 1 -c 34000
    |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s | ------|--------|----------|----------|----------|----------|----------|----------| 128 |    1 |   1128 |    2.543 |   393.23 |    9.829 |    13.02 |   12.372 |    91.17 | 128 |    1 |   2128 |    5.400 |   370.36 |    9.891 |    12.94 |   15.291 |   139.17 | 128 |    1 |   4128 |   10.950 |   365.30 |    9.972 |    12.84 |   20.922 |   197.31 | 128 |    1 |   8128 |   22.762 |   351.46 |   10.118 |    12.65 |   32.880 |   247.20 | 128 |    1 |  16128 |   49.386 |   323.98 |   10.387 |    12.32 |   59.773 |   269.82 | 128 |    1 |  32128 |  114.218 |   280.16 |   10.950 |    11.69 |  125.169 |   256.68 |
 
 cpburns2009 font-medium text-muted">5 hours ago 
 
 Results are nearly identical running on a Strix Halo using Vulkan, llama.cpp b8884:
    $ llama-batched-bench -dev Vulkan2 -hf unsloth/Qwen3.6-27B-GGUF:IQ4_XS -npp 1000,2000,4000,8000,16000,32000 -ntg 128 -npl 1 -c 34000 TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s | ------|--------|----------|----------|----------|----------|----------|----------| 128 |    1 |   1128 |    3.288 |   304.15 |    9.873 |    12.96 |   13.161 |    85.71 | 128 |    1 |   2128 |    6.415 |   311.79 |    9.883 |    12.95 |   16.297 |   130.57 | 128 |    1 |   4128 |   13.113 |   305.04 |    9.979 |    12.83 |   23.092 |   178.76 | 128 |    1 |   8128 |   27.491 |   291.01 |   10.155 |    12.61 |   37.645 |   215.91 | 128 |    1 |  16128 |   59.079 |   270.83 |   10.476 |    12.22 |   69.555 |   231.87 | 128 |    1 |  32128 |  148.625 |   215.31 |   11.084 |    11.55 |  159.709 |   201.17 |
 
 
 
amstan font-medium text-muted">7 hours ago 
 
 you should try vulkan instead of rocm. it goes like 20% faster. 
 MrDrMcCoy font-medium text-muted">6 hours ago 
 
 Is that based on recent experience? With "stable" ROCm, or the (IMHO better) releases from TheRock? With older or more recent hardware? The AMD landscape is rather uneven. 
 
 
cpburns2009 font-medium text-muted">5 hours ago 
 
 For this model results are identical. In my experience it can go either way by up to 10%.

endymi0n font-medium text-muted">8 hours ago

 at this trajectory, unsloth are going to release the models BEFORE the model drop within the next weeks... 
 danielhanchen font-medium text-muted">8 hours ago 
 
 Haha :) 
 cpburns2009 font-medium text-muted">5 hours ago 
 
 Do you get early access so you can prep the quants for release? 
 1 reply →

ggerganov 
						3 hours ago

 llama-batched-bench -hf ggml-org/Qwen3.6-27B-GGUF -npp 512,1024,2048,4096,8192,16384,32768 -ntg 128 -npl 1 -c 36000
M2 Ultra, Q8_0
  |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
  |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
  |   512 |    128 |    1 |    640 |    1.307 |   391.69 |    6.209 |    20.61 |    7.516 |    85.15 |
  |  1024 |    128 |    1 |   1152 |    2.534 |   404.16 |    6.227 |    20.56 |    8.760 |   131.50 |
  |  2048 |    128 |    1 |   2176 |    5.029 |   407.26 |    6.229 |    20.55 |   11.258 |   193.29 |
  |  4096 |    128 |    1 |   4224 |   10.176 |   402.52 |    6.278 |    20.39 |   16.454 |   256.72 |
  |  8192 |    128 |    1 |   8320 |   20.784 |   394.14 |    6.376 |    20.08 |   27.160 |   306.33 |
  | 16384 |    128 |    1 |  16512 |   43.513 |   376.53 |    6.532 |    19.59 |   50.046 |   329.94 |
  | 32768 |    128 |    1 |  32896 |   99.137 |   330.53 |    7.081 |    18.08 |  106.218 |   309.70 |


DGX Spark, Q8_0
  |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
  |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
  |   512 |    128 |    1 |    640 |    0.881 |   580.98 |   16.122 |     7.94 |   17.003 |    37.64 |
  |  1024 |    128 |    1 |   1152 |    1.749 |   585.43 |   16.131 |     7.93 |   17.880 |    64.43 |
  |  2048 |    128 |    1 |   2176 |    3.486 |   587.54 |   16.169 |     7.92 |   19.655 |   110.71 |
  |  4096 |    128 |    1 |   4224 |    7.018 |   583.64 |   16.245 |     7.88 |   23.263 |   181.58 |
  |  8192 |    128 |    1 |   8320 |   14.189 |   577.33 |   16.427 |     7.79 |   30.617 |   271.75 |
  | 16384 |    128 |    1 |  16512 |   29.015 |   564.68 |   16.749 |     7.64 |   45.763 |   360.81 |
  | 32768 |    128 |    1 |  32896 |   60.413 |   542.40 |   17.359 |     7.37 |   77.772 |   422.98 |