Comment by simonw
15 hours ago
I got this running on a 128GB M5 the other day - pretty painless, model runs in about 80GB of RAM and it seemed to be very capable at writing code and tool execution.
15 hours ago
I got this running on a 128GB M5 the other day - pretty painless, model runs in about 80GB of RAM and it seemed to be very capable at writing code and tool execution.
How’s the token throughput / response time?
Healthy!
From https://gist.github.com/simonw/31127f9025845c4c9b10c3e0d8612...
Prefill is 400 t/s in that hardware. Just if the prompt is very short you can't see the real speed and it will default to single token context processing.
Comparison with a RTX Pro 6000, with DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf:
prefill: 121.76 t/s, generation: 47.85 t/s
Main target seems to be Apple's Metal, so makes sense. Might be fun to see how fast one could make it go though :) The model seems really good too, even though it's in IQ2.
I don't want to be a jerk but 31t/s prefill is basically unusable in an agentic situation. A mere 10k in context and you're sitting there for 5+ minutes before the first token is generated.
6 replies →
what are token speeds like for frontier models, if that gives a rough idea of how much "slower" slow is?
So you’re saying I should buy the M5? :) I’ve been resisting, thinking I’ll never use it… it’ll be better in a year… I’ll wait for the Studio (do we still think that’s coming in June?)… etc.
I expect this to be my main machine for the next 3-4 years (which is how I justified the 128GB one). It's a beast of a machine - I love that I can run an 80GB model and still have 48GB left for everything else.
Can't say that it wouldn't be a better idea to spend that cash on tokens from the frontier hosted models though.
I'm an LLM nerd so running local models is worth it from a research perspective.
An M5 Max MBP with 128G of RAM costs ~$5k. An Nvidia RTX 5090 with 32G RAM is $4-5k, and RTX PRO 6000 with 96GB RAM $10k. Do you have any data on which is the best price/performance for local inference? Do you know what the big OpenAI/Anthropic/Google datacenters are running?
3 replies →