← Back to context

Comment by behnamoh

20 hours ago

until Microsoft buys it and enshits it.

This is a fun thought experiment. I believe that we are now at the $5 Uber (2014) phase of LLMs. Where will it go from here?

How much will a synthetic mid-level dev (Opus 4.5) cost in 2028, after the VC subsidies are gone? I would imagine as much as possible? Dynamic pricing?

Will the SOTA model labs even sell API keys to anyone other than partners/whales? Why even that? They are the personalized app devs and hosts!

Man, this is the golden age of building. Not everyone can do it yet, and every project you can imagine is greatly subsidized. How long will that last?

  • While I remember $5 Ubers fondly, I think this situation is significantly more complex:

    - Models will get cheaper, maybe way cheaper

    - Model harnesses will get more complex, maybe way more complex

    - Local models may become competitive

    - Capital-backed access to more tokens may become absurdly advantaged, or not

    The only thing I think you can count on is that more money buys more tokens, so the more money you have, the more power you will have ... as always.

    But whether some version of the current subsidy, which levels the playing field, will persist seems really hard to model.

    All I can say is, the bad scenarios I can imagine are pretty bad indeed—much worse than that it's now cheaper for me to own a car, while it wasn't 10 years ago.

    • If the electric grid cannot keep up with the additional demand, inference may not get cheaper. The cost of electricity would go up for LLM providers, and VCs would have to subsidize them more until the price of electricity goes down, which may take longer than they can wait, if they have been expecting LLM's to replace many more workers within the next few years.

  • The real question is how long it'll take for Z.ai to clone it at 80% quality and offer it at cost. The answer appears to be "like 3 months".

    • This is a super interesting dynamic! The CCP is really good at subsidizing and flooding global markets, but in the end, it takes power to generate tokens.

      In my Uber comparison, it was physical hardware on location... taxis, but this is not the case with token delivery.

      This is such a complex situation in that regard, however, once the market settles and monopolies are created, eventually the price will be what market can bear. Will that actually create an increase in gross planet product, or will the SOTA token providers just eat up the existing gross planet product, with no increase?

      I suppose whoever has the cheapest electricity will win this race to the bottom? But... will that ever increase global product?

      ___

      Upon reflection, the comment above was likely influenced by this truly amazing quote from Satya Nadella's interview on the Dwarkesh podcast. This might be one of the most enlightened things that I have ever heard in regard to modern times:

      > Us self-claiming some AGI milestone, that's just nonsensical benchmark hacking to me. The real benchmark is: the world growing at 10%.

      https://www.dwarkesh.com/p/satya-nadella#:~:text=Us%20self%2...

      1 reply →

  • I can run Minimax-m2.1 on my m4 MacBook Pro at ~26 tokens/second. It’s not opus, but it can definitely do useful work when kept on a tight leash. If models improve at anything like the rate we have seen over the last 2 years I would imagine something as good as opus 4.5 will run on similarly specced new hardware by then.

    • I appreciate this, however, as a ChatGPT, Claude.ai, Claude Code, and Windsurf user... who has tried nearly every single variation of Claude, GPT, and Gemini in those harnesses, and has tested all the those models via API for LLM integrations into my own apps... I just want SOTA, 99% of the time, for myself, and my users.

      I have never seen a use case where a "lower" model was useful, for me, and especially my users.

      I am about to get almost the exact MacBook that you have, but I still don't want to inflict non-SOTA models on my code, or my users.

      This is not a judgement against you, or the downloadable weights, I just don't know when it would be appropriate to use those models.

      BTW, I very much wish that I could run Opus 4.5 locally. The best that I can do for my users is the Azure agreement that they will not train on their data. I also have that setting set on my claude.ai sub, but I trust them far less.

      Disclaimer: No model is even close to Opus 4.5 for agentic tasks. In my own apps, I process a lot of text/complex context and I use Azure GPT 4.1 for limited llm tasks... but for my "chat with the data" UX, Opus 4.5 all day long. It has tested so superior.

      2 replies →

    • I'm guessing that's ~26 decode tokens/s for 2-bit or 3-bit quantized Minimax-m2.1 at 0 context, and it only gets worse as the context grows.

      I'm also sure your prefill is slow enough to make the model mostly unusable, even at smallish context windows, but entirely at mid to large context.