← Back to context

Comment by simonw

14 hours ago

I got this running on a 128GB M5 the other day - pretty painless, model runs in about 80GB of RAM and it seemed to be very capable at writing code and tool execution.

How’s the token throughput / response time?

  • Healthy!

      prefill: 30.91 t/s, generation: 29.58 t/s
    

    From https://gist.github.com/simonw/31127f9025845c4c9b10c3e0d8612...

    • Comparison with a RTX Pro 6000, with DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf:

      prefill: 121.76 t/s, generation: 47.85 t/s

      Main target seems to be Apple's Metal, so makes sense. Might be fun to see how fast one could make it go though :) The model seems really good too, even though it's in IQ2.

    • I don't want to be a jerk but 31t/s prefill is basically unusable in an agentic situation. A mere 10k in context and you're sitting there for 5+ minutes before the first token is generated.

      6 replies →

    • what are token speeds like for frontier models, if that gives a rough idea of how much "slower" slow is?

So you’re saying I should buy the M5? :) I’ve been resisting, thinking I’ll never use it… it’ll be better in a year… I’ll wait for the Studio (do we still think that’s coming in June?)… etc.

  • I expect this to be my main machine for the next 3-4 years (which is how I justified the 128GB one). It's a beast of a machine - I love that I can run an 80GB model and still have 48GB left for everything else.

    Can't say that it wouldn't be a better idea to spend that cash on tokens from the frontier hosted models though.

    I'm an LLM nerd so running local models is worth it from a research perspective.

    • An M5 Max MBP with 128G of RAM costs ~$5k. An Nvidia RTX 5090 with 32G RAM is $4-5k, and RTX PRO 6000 with 96GB RAM $10k. Do you have any data on which is the best price/performance for local inference? Do you know what the big OpenAI/Anthropic/Google datacenters are running?

      3 replies →