Comment by scottjg

12 hours ago

I very recently ran the numbers on these GPUs for an upcoming blog post. The token generation performance is bad, but the prefill performance is _really_ bad.

For a Qwen 3.6 35B / 3B MoE, 4-bit quant:

- parsing a 4k prompt on a M4 Macbook Air takes 17 seconds before generating a single token.

- on an M4 Max Mac Studio it's faster at 2.3 seconds

- on an RTX 5090, it's 142ms.

RTX 5090 uses more power than an M4 Max Mac Studio but it's not 16x more power.