Comment by busfahrer

1 day ago

I have been contemplating a M5 Pro MBP, but for the life for me I wasn't able to find benchmarks for real-world models, do you happen to know how many tokens per second roughly you get with MoE models like Qwen 3.6 35B/A3B or Gemma 4 26B?

11 comments

busfahrer

ahknight 1 day ago

I'm not normally one to share videos as answers, but this particular fellow does a LOT of work with local AIs and Macs and happens to have a nuanced answer. https://youtu.be/XGe7ldwFLSE

embedding-shape 1 day ago

You need to ask macOS people for their prefill speed as well, there are two numbers you care about here, and current MacBooks have generally terrible numbers when it comes to prefill performance. Surely it'll get better with time, but if you already have a desktop, I'd go the "beefy GPU" route first.

egorfine 1 day ago

Qwen 3.6 35B running on oMLX 0.3.9rc1: on oMLX I get 86 t/s on Q4 and 74 t/s on Q6.

Bear in mind that ttft on MLX is much much faster on M5 Pro as compared to M4 Pro.

Also bear in mind that those figures are with NO optimizations whatsoever: no MCP, no DFlash. I am waiting for both to be released for the Qwen models.

busfahrer 1 day ago
Great, thanks! :-) and to mirror another poster: what kind of prompt parsing (prefill) speed do you get for that model? Also how is the speed for the 27B model?
- egorfine 1 day ago
  
  35B: 1300-1800 t/s on both Q4 and Q6.
  27B: give me 20 minutes
  
  1 reply →

egorfine 1 day ago

Qwen3.6 27B oQ6: 12.5 t/s generation, 340-360 t/s pp.

egorfine 1 day ago

Native MCP:

For Qwen 35B enabling native MCP on MLX models slows it down by 10%.

For Qwen 27B enabling native MCP on MLX models speeds token generation up almost exactly 1.5x.

(all tested on M5 pro).

mlvljr 1 day ago

[dead]

juancn 1 day ago

I'm running unsloth/Qwen3.6-35B-A3B-UD-Q8_K_XL on an M3 Max, 64GB at ~57 t/s with llama-server

brcmthrowaway 1 day ago

Prefill speed and 27B number?