← Back to context

Comment by jurgenburgen

6 hours ago

If models become more efficient we will move more of the work to local devices instead of using SaaS models. We’re still in the mainframe era of LLM.

We moved from the mainframe era to desktops and smaller servers because computers got fast enough to do what we needed them to do locally. Centralized computing resources are still vastly more powerful than what's under your desk or in a laptop, but it doesn't matter because people generally don't need that much power for their daily tasks.

The problem with AI is that it's not obvious what the upper limit of capability demand might be. And until or if we get there, there will always be demand for the more capable models that run on centralized computing resources. Even if at some point I'm able to run a model on my local desktop that's equivalent to current Claude Opus, if what Anthropic is offering as a service is significantly better in a way that matters to my use case, I will still want to use the SaaS one.

  • The underlying advantage of local inference is that you're repurposing your existing hardware for free. You don't need your token spend to pay a share of the capex cost for datacenters that are large enough to draw gigawatts in power, you can just pay for your own energy use. Even though the raw energy cost per operation will probably be higher for local inference, the overall savings in hardware costs can still be quite real.

  • > Even if at some point I'm able to run a model on my local desktop that's equivalent to current Claude Opus, if what Anthropic is offering as a service is significantly better in a way that matters to my use case, I will still want to use the SaaS one.

    Only if it's competitively priced. You wouldn't want to use the SaaS if the breakeven in investment on local instances is a matter of months.

    Right now people are shelling out for Claude Code and similar because for $200/m they can consume $10k/m of tokens. If you were actually paying $10k/m, than it makes sense to splurge $20k-$30k for a local instance.

The hyperscalers do not want us running models at the edge and they will spend infinite amounts of circular fake money to ensure hardware remains prohibitively expensive forever.

  • > they will spend infinite amounts of circular fake money > forever

    If that's the plan (there is no plan) then it expires at some point, because it's a spiral and such spirals always bottom out.

  • > of circular fake money

    Oh it gets worse than that, the money which caused all of this by OpenAI was taken from Japanese banks at cheap interest rates (by softbank for the stargate project), and the Japanese Banks are able to do it because of Japanese people/Japanese companies and also the collateral are stocks which are inflated by the value of people who invest their hard earned money into the markets

    So in a way they are using real hard earned money to fund all of this, they are using your money to basically attack you behind your backs.

    I once wrote an really long comment about the shaky finances of stargate, I feel like suggesting it here: https://news.ycombinator.com/item?id=47297428

  • > and they will spend infinite amounts of circular fake money to ensure hardware remains prohibitively expensive forever.

    That's ridiculous, "infinite money" isn't a thing. They will spend as much as they can not because they want to keep local solutions out, but because it enables them to provide cheaper services and capture more of the market. We all eventually benefit from that.

    • > That's ridiculous, "infinite money" isn't a thing.

      My reading of GP is that he was being sarcastic - "infinite amounts of circular fake money" is probably a reference to these circular deals going on.

      If A hands B investment of $100, then B hands A $100 for purchase of hardware, A's equity in B, on paper, is $100, plus A has revenue of $100 (from B), which gives A total assets of $200.

      Obviously it has to be shuffled more thoroughly, but that's the basic idea that I thought GP was referring to.

As I understand this advancement, this doesn't let you run bigger models, it lets you maintain more chat context. So Anthropic and OpenAI won't need as much hardware running inference to serve their users, but it doesn't do much to make bigger models work on smaller hardware.

Though I'm not an expert, maybe my understanding of the memory allocation is wrong.

  • Seems to me if the model and the kv cache are competing for the same pool of memory, then massively compressing the cache necessarily means more ram available for (if it fits) a larger model, no?

    • Yes, but the context is a comparatively smaller part of how much memory is used when running it locally for a single user, vs when running it on a server for public... serving.

> If models become more efficient

Then we can make them even bigger.

  • > Then we can make them even bigger.

    But what if it becomes "good enough", that for most intents and purposes, small models can be "good enough"

    There are some people here/on r/localllama who I have seen run some small models and sometimes even run multiple of them to solve/iterate quickly and have a larger model plug into it and fix anything remaining.

    This would still mean that larger/SOTA models might have some demand but I don't think that the demand would be nearly enough that people think, I mean, we all still kind of feel like there are different models which are good for different tasks and a good recommendation is to benchmark different models for your own use cases as sometimes there are some small models who can be good within your particular domain worth having within your toolset.

    • Because the true goal is AGI, not just nice little tools to solve subsets of problems. The first company which can achieve human level intelligence will just be able to self-improve at such a rate as to create a gigantic moat

      1 reply →

    • > But what if it becomes "good enough", that for most intents and purposes, small models can be "good enough"

      It's simple: then we'll make our intents and purposes bigger.

I don't see how we'll ever get to widespread local LLM.

The power efficiency alone is a strong enough pressure to use centralized model providers.

My 3090 running 24b or 32b models is fun, but I know I'm paying way more per token in electricity, on top of lower quality tokens.

It's fun to run them locally, but for anything actually useful it's cheaper to just pay API prices currently.

I don't think we are there yet. Models running in data centers will still be noticeably better as efficiency will allow them to build and run better models.

Not many people would like today models comparable to what was SOTA 2 years ago.

To run models locally and have results as good as the models running in data centers we need both efficiency and to hit a wall in AI improvement.

None of those two conditions seem to become true for the near future.

I like the mainframe comparison but isn't there a key difference? Mainframes died because hardware got cheap -- that's predictable. LLM efficiency improving enough to run locally needs algorithmic breakthroughs, which... aren't. My gut says we'll end up with a split. Stuff where latency matters (copilot, local agents) moves to edge once models actually fit on a laptop. But training and big context windows stay in the cloud because that's where the data lives. One thing I keep going back and forth on: is MoE "better math" or just "better engineering"? Feels like that distinction matters a lot for where this all goes.

  • MoE feels a lot more like engineering to me. You're routing around the problem rather than actually solving it. The real math gains are things like quantization schemes that change how information is actually represented. Whether that distinction matters long term probably will depend on whether we hit a capability wall first or an efficiency ceiling first.