Comment by benob
7 hours ago
Here is llama-bench on the same M4:
| model | size | params | backend | threads | test | t/s |
| ------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen35 27B Q4_K_M | 15.65 GiB | 26.90 B | BLAS,MTL | 4 | pp512 | 61.31 ± 0.79 |
| qwen35 27B Q4_K_M | 15.65 GiB | 26.90 B | BLAS,MTL | 4 | tg128 | 5.52 ± 0.08 |
| qwen35moe 35B.A3B Q3_K_M | 15.45 GiB | 34.66 B | BLAS,MTL | 4 | pp512 | 385.54 ± 2.70 |
| qwen35moe 35B.A3B Q3_K_M | 15.45 GiB | 34.66 B | BLAS,MTL | 4 | tg128 | 26.75 ± 0.02 |
So ~60 for prefill and ~5 for output on 27B and about 5x on 35B-A3B.
No comments yet
Contribute on Hacker News ↗