Comment by simonw

15 hours ago

I got this running on a 128GB M5 the other day - pretty painless, model runs in about 80GB of RAM and it seemed to be very capable at writing code and tool execution.

18 comments

simonw

perfmode 15 hours ago

How’s the token throughput / response time?

simonw 15 hours ago
Healthy!
prefill: 30.91 t/s, generation: 29.58 t/s
From https://gist.github.com/simonw/31127f9025845c4c9b10c3e0d8612...
- antirez 3 minutes ago
  
  Prefill is 400 t/s in that hardware. Just if the prompt is very short you can't see the real speed and it will default to single token context processing.
- embedding-shape 14 hours ago
  
  Comparison with a RTX Pro 6000, with DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf:
  prefill: 121.76 t/s, generation: 47.85 t/s
  Main target seems to be Apple's Metal, so makes sense. Might be fun to see how fast one could make it go though :) The model seems really good too, even though it's in IQ2.
- xienze 15 hours ago
  
  I don't want to be a jerk but 31t/s prefill is basically unusable in an agentic situation. A mere 10k in context and you're sitting there for 5+ minutes before the first token is generated.
  
  6 replies →
- rtpg 12 hours ago
  
  what are token speeds like for frontier models, if that gives a rough idea of how much "slower" slow is?

chatmasta 13 hours ago

So you’re saying I should buy the M5? :) I’ve been resisting, thinking I’ll never use it… it’ll be better in a year… I’ll wait for the Studio (do we still think that’s coming in June?)… etc.

simonw 12 hours ago
I expect this to be my main machine for the next 3-4 years (which is how I justified the 128GB one). It's a beast of a machine - I love that I can run an 80GB model and still have 48GB left for everything else.
Can't say that it wouldn't be a better idea to spend that cash on tokens from the frontier hosted models though.
I'm an LLM nerd so running local models is worth it from a research perspective.
- simpaticoder 9 hours ago
  
  An M5 Max MBP with 128G of RAM costs ~$5k. An Nvidia RTX 5090 with 32G RAM is $4-5k, and RTX PRO 6000 with 96GB RAM $10k. Do you have any data on which is the best price/performance for local inference? Do you know what the big OpenAI/Anthropic/Google datacenters are running?
  
  3 replies →