← Back to context

Comment by hnfong

20 days ago

I have a maxed out M3 Ultra. It runs quantized large open Chinese models pretty well. It's slow-ish, but since I don't use them very frequently, most of the time is waiting to the model to load from disk to RAM.

There are benchmarks on token generation speed out there for some of the large models. You can probably guess the speed for models you're interested in by comparing the sizes (mostly look at the active params).

Currently the main issue for M1-M4 is the prompt "preprocessing" speed. In practical terms, if you have a very long prompt, it's going to take a much long time to process it. IIRC it's due to lack of efficient matrix multiplication operations in the hardware, which I hear is rectified in the M5 architecture. So if you need to process long prompts, don't count on the Mac Studio, at least not with large models.

So in short, if your prompts are relatively short (eg. a couple thousand tokens at most), you need/want a large model, you don't need too much scale/speed, and you need to run inference locally, then Macs are a reasonable option.

For me personally, I got my M3 Ultra somewhat due to geopolitical issues. I'm barred from accessing some of the SOTA models from the US due to where I live, and sometimes the Chinese models are not conveniently accessible either. With the hardware, they can pry DeepSeek R1, Kimi-K2, etc. from my cold dead hands lol.