← Back to context

Comment by simonw

6 hours ago

Models of this size can usually be run using MLX on a pair of 512GB Mac Studio M3 Ultras, which are about $10,000 each so $20,000 for the pair.

You might want to clarify that this is more of a "Look it technically works"

Not a "I actually use this"

The difference between waiting 20 minutes to answer the prompt '1+1='

and actually using it for something useful is massive here. I wonder where this idea of running AI on CPU comes from. Was it Apple astroturfing? Was it Apple fanboys? I don't see people wasting time on non-Apple CPUs. (Although, I did do this for a 7B model)

  • The reason Macs get recommended is the unified memory, which is usable as VRAM for the GPU. People are similarly using the AMD Strix Halo for AI which also has a similar memory architecture. Time to first token for something like '1+1=' would be seconds, and then you'd be getting ~20 tokens per second, which is absolutely plenty fast for regular use. Token/s slows down at the higher end of context, but it's absolutely still practical for a lot of usecases. Though I agree that agentic coding, especially over large projects, would likely get too slow to be practical.

    • Not too slow if you just let it run overnight/in the background. But the biggest draw would be no rate limits whatsoever compared to the big proprietary APIs, especially Claude's.

  • Mac studio way is not "AI on CPU," as M2/M4 are complex SoC, that includes a GPU with unified memory access.