Comment by scottjg
12 hours ago
I very recently ran the numbers on these GPUs for an upcoming blog post. The token generation performance is bad, but the prefill performance is _really_ bad.
For a Qwen 3.6 35B / 3B MoE, 4-bit quant:
- parsing a 4k prompt on a M4 Macbook Air takes 17 seconds before generating a single token.
- on an M4 Max Mac Studio it's faster at 2.3 seconds
- on an RTX 5090, it's 142ms.
RTX 5090 uses more power than an M4 Max Mac Studio but it's not 16x more power.
No comments yet
Contribute on Hacker News ↗