Comment by FuckButtons
16 hours ago
I can run Minimax-m2.1 on my m4 MacBook Pro at ~26 tokens/second. It’s not opus, but it can definitely do useful work when kept on a tight leash. If models improve at anything like the rate we have seen over the last 2 years I would imagine something as good as opus 4.5 will run on similarly specced new hardware by then.
I appreciate this, however, as a ChatGPT, Claude.ai, Claude Code, and Windsurf user... who has tried nearly every single variation of Claude, GPT, and Gemini in those harnesses, and has tested all the those models via API for LLM integrations into my own apps... I just want SOTA, 99% of the time, for myself, and my users.
I have never seen a use case where a "lower" model was useful, for me, and especially my users.
I am about to get almost the exact MacBook that you have, but I still don't want to inflict non-SOTA models on my code, or my users.
This is not a judgement against you, or the downloadable weights, I just don't know when it would be appropriate to use those models.
BTW, I very much wish that I could run Opus 4.5 locally. The best that I can do for my users is the Azure agreement that they will not train on their data. I also have that setting set on my claude.ai sub, but I trust them far less.
Disclaimer: No model is even close to Opus 4.5 for agentic tasks. In my own apps, I process a lot of text/complex context and I use Azure GPT 4.1 for limited llm tasks... but for my "chat with the data" UX, Opus 4.5 all day long. It has tested so superior.
Is Azure's pricing competitive on openAI's offerings through the api? Thanks!
The last I checked, it is exactly equivalent per token to direct OpenAI model inference.
The one thing I wish for is that Azure Opus 4.5 had json structured output. Last I checked that was in "beta" and only allowed via direct Anthropic API. However, after many thousands of Opus 4.5 Azure API calls with the correct system and user prompts, not even one API call has returned invalid json.
I'm guessing that's ~26 decode tokens/s for 2-bit or 3-bit quantized Minimax-m2.1 at 0 context, and it only gets worse as the context grows.
I'm also sure your prefill is slow enough to make the model mostly unusable, even at smallish context windows, but entirely at mid to large context.