← Back to context

Comment by benob

10 hours ago

I get ~5 tokens/s on an M4 with 32G of RAM, using:

  llama-server \
   -hf unsloth/Qwen3.6-27B-GGUF:Q4_K_M \
   --no-mmproj \
   --fit on \
   -np 1 \
   -c 65536 \
   --cache-ram 4096 -ctxcp 2 \
   --jinja \
   --temp 0.6 \
   --top-p 0.95 \
   --top-k 20 \
   --min-p 0.0 \
   --presence-penalty 0.0 \
   --repeat-penalty 1.0 \
   --reasoning on \
   --chat-template-kwargs '{"preserve_thinking": true}'

35B-A3B model is at ~25 t/s. For comparison, on an A100 (~RTX 3090 with more memory) they fare respectively at 41 t/s and 97 t/s.

I haven't tested the 27B model yet, but 35B-A3B often gets off rails after 15k-20k tokens of context. You can have it to do basic things reliably, but certainly not at the level of "frontier" models.

Why use --fit on on an M4? My understanding was that given the unified memory, you should push all layers to the GPU with --n-gpu-layers all. Setting --flash-attn on and --no-mmap may also get you better results.

I confirm with the GGUF version at q4, 35B-A3B starts going in thinking loops at 60k basically

When you say tok/s here are you describing the prefill (prompt eval) token/s or the output generation tok/s?

(Btw I believe the "--jinja" flag is by default true since sometime late 2025, so not needed anymore)

  • Here is llama-bench on the same M4:

      | model                    |       size |     params | backend    | threads |            test |                  t/s |
      | ------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
      | qwen35 27B Q4_K_M        |  15.65 GiB |    26.90 B | BLAS,MTL   |       4 |           pp512 |         61.31 ± 0.79 |
      | qwen35 27B Q4_K_M        |  15.65 GiB |    26.90 B | BLAS,MTL   |       4 |           tg128 |          5.52 ± 0.08 |
      | qwen35moe 35B.A3B Q3_K_M |  15.45 GiB |    34.66 B | BLAS,MTL   |       4 |           pp512 |        385.54 ± 2.70 |
      | qwen35moe 35B.A3B Q3_K_M |  15.45 GiB |    34.66 B | BLAS,MTL   |       4 |           tg128 |         26.75 ± 0.02 |
    

    So ~60 for prefill and ~5 for output on 27B and about 5x on 35B-A3B.

  • If someone doesn't specifically say prefill then they always mean decode speed. I have never seen an exception. Most people just ignore prefill.

    • But isn't the prefill speed the bottleneck in some systems* ?

      Sure it's order of magnitude faster (10x on Apple Metal?) but there's also order of magnitude more tokens to process, especially for tasks involving summarization of some sort.

      But point taken that the parent numbers are probably decode

      * Specifically, Mac metal, which is what parent numbers are about

      2 replies →

How is the quality of model answers to your queries? Are they stable over time?

I am wondering how to measure that anyway.

Using opencode and Qwen-Coder-Next I get it reliably up to about 85k before it takes too long to respond.

I tried the other qwen models and the reasoning stuff seems to do more harm than good.