Comment by bigyabai

3 days ago

It might run the smaller flash version, but 96gb is not enough for the trillion-parameter model.

The M3 Ultra's GPU is a bit on the weak side for large-scale inference, so you'll be waiting on token prefill for most coding/agent workflows.

4 comments

bigyabai

Reply

namegulf 3 days ago

They have a 512gb ram option but pricey.

Have you tried any other models with this M3 Ultra?

bigyabai 3 days ago
The 512gb model would have to use a lobotomized quant like q_2 or q_1, and you would still be waiting 3-5 minutes to process context lengths in the 32,000-64,000 token range.
Apple's GPUs are just not very fast for inference. I'd stick to the smaller 7b-18b parameter range or MOE models like Qwen if you want a usable inference speed.
- namegulf 3 days ago
  
  Looks like that's a good idea for now. Yeah 3-5 mins is not practical use.
  Any thoughts on M5?
  They may be soon releasing a M5 model with mac studio/mini.
- namegulf 3 days ago
  
  NVIDIA DGX Spark a good option?
  $4,699.00
  But looks like we may need a NVIDIA AI Enterprise - DGX Spark License