Comment by jl6

7 days ago

If the explanation really is, as many comments here suggest, that prompts can be run in parallel in batches at low marginal additional cost, then that feels like bad news for the democratization and/or local running of LLMs. If it’s only cost-effective to run a model for ~thousands of people at the same time, it’s never going to be cost-effective to run on your own.

Sure, but that's how most of human society works already.

It's more cost effective to farm eggs from a hundred thousand chickens than it is for individuals to have chickens in their yard.

You CAN run a GPT-class model on your own machine right now, for several thousand dollars of machine... but you can get massively better results if you spend those thousands of dollars on API credits over the next five years or so.

Some people will choose to do that. I have backyard chickens, they're really fun! Most expensive eggs I've ever seen in my life.

  • 50 years ago general computers were also time shared. Then the pendulum swing to desktop, then back to central.

    I for one look forward to another 10 years of progress - or less - putting current models running on a laptop. I don’t trust any big company with my data

For fungible things, it's easy to cost out. But not all things can be broken down just in token cost, especially as people start building their lives around specific models.

Even beyond privacy just the availability is out of your control - you can look at r/ChatGPT's collective spasm yesterday when 4o was taken from them, but basically, you have no guarantees to access for services, and for LLM models in particular, "upgrades" can completely change behavior/services that you depend on.

Google has been even worse in the past here, I've seen them deprecate model versions with 1 month notices. It seems a lot of model providers are doing dynamic model switching/quanting/reasoning effort adjustments based on load now.

Well, you can also batch your own queries. Not much use for a chatbot but for an agentic system or offline batch processing it becomes more reasonable.

Consider a system were running a dozen queries at once is only marginally more expensive than running one query. What would you build?

That determines the cost effectiveness to make it worth it to train one of these models in the first place. Using someone else's weights, you can afford to predict quite inefficiently.