Comment by pixelesque

1 hour ago

Out of interest, what machine and model are you running it on?

I tried the qwen3.6-27b Q6_k GUFF in llama.cpp and LM Studio on my M2 MacBook Pro 32GB machine last week, and I barely get a token a second with either.

What sort of speed should I be expecting?

I tried some of the Llama 3 34b (nous-capybara?) models two years ago with llama.cpp, and I seem to remember getting a few tokens a second then, so not sure if I've got something completely mis-configured, or I just have unreasonable expectations.

Or maybe qwen 3.x is slower for some reason? (Is it mixture of experts?)

I'm not expecting it to be instant, but what I'm currently seeing is not really usable.

9 comments

pixelesque

gcr 1 hour ago

There are two flavors of Qwen 3.6:

- A 27B "dense" model

- A 35B "Mixture of Experts" model, which activates only 3B parameters for each token.

For your hardware, I strongly recommend `unsloth/Qwen3.6-35B-A3B-GGUF:Q4_K_M`. I have an M1 Max with 32GB VRAM from 2021 that can read at ~300-500 tokens/sec and write at ~30 tokens/sec with llama-cpp's default settings, which is plenty fast. The 27B model can read ~70tok/sec and write ~5tok/sec.

The 35B MoE model technically takes slightly more memory but is much faster because it's doing 1/9th the work. It's not quite as "smart", but it's comparable.

pixelesque 11 minutes ago

Thank you - I'll give that a go!
julianlam 28 minutes ago

May I ask why the M instead of XL?
Obviously bigger != better but I don't know what the differences are.

mft_ 1 hour ago

The 27B model is dense, so is relatively slow. The 35B-A3B model is marginally weaker but being MoE is much faster - like ~4-8x faster in basic benchmarks on my M1 Max.

For comparison, I just ran a couple of quick benchmarks (default settings) with llama-bench:

Qwen3.6-35B-A3B at Q6_K_XL gave 858 t/s pp512 (prompt processing) and 43 t/s tg128 (token generation).

Qwen3.6-27B at Q4_K_XL gave 103 t/s pp512 and 8 t/s tg128.

pixelesque 11 minutes ago

Thanks for the info.

Figs 1 hour ago

27B is the dense one. Try the Qwen3.6-35B-A3B variants for the MoE release. That's what I'm running on a Framework Desktop and I get ~50 tok/s plus or minus a few. The dense one is similarly slow for me -- not sure what to expect on your hardware from the MoE but it should probably be much faster.

pixelesque 11 minutes ago

Thanks!

KronisLV 1 hour ago

> qwen3.6-27b Q6_k

That's the dense model, you probably want a mixture-of-experts (MoE) one.

Here's what you probably want instead: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF

pixelesque 11 minutes ago

Thanks!