Comment by renegade-otter
2 days ago
They are not worse - the results are not repeatable. The problem is much worse.
Like with cab hailing, shopping, social media ads, food delivery, etc: there will be a whole ecosystem, workflows, and companies built around this. Then the prices will start going up with nowhere to run. Their pricing models are simply not sustainable. I hope everyone realizes that the current LLMs are subsidized, like your Seamless and Uber was in the early days.
A key difference is that the cost to execute a cab ride largely stayed the same. Gas to get you from point A to point B is ~$5, and there's a floor on what you can pay the driver. If your ride costs $8 today, you know that's unsustainable; it'll eventually climb to $10 or $12.
But inference costs are dropping dramatically over time, and that trend shows no signs of slowing. So even if a task costs $8 today thanks to VC subsidies, I can be reasonably confident that the same task will cost $8 or less without subsidies in the not-too-distant future.
Of course, by then we'll have much more capable models. So if you want SOTA, you might see the jump to $10-12. But that's a different value proposition entirely: you're getting significantly more for your money, not just paying more for the same thing.
>But inference costs are dropping dramatically over time,
Please prove this statement, so far there is no indication that this is actually true - the opposite seems to be the case. Here are some actual numbers [0] (and whether you like Ed or not, his sources have so far always been extremely reliable.)
There is a reason the AI companies don't ever talk about their inference costs. They boast with everything they can find, but inference... not.
[0]: https://www.wheresyoured.at/oai_docs/
I believe OP's point is that for a given model quality, inference cost decreases dramatically over time. The article you linked talks about effective total inference costs which seem to be increasing.
Those are not contradictory: a company's inference costs can increase due to deploying more models (Sora), deploying larger models, doing more reasoning, and an increase in demand.
However, if we look purely at how much it costs to run inference on a fixed amount of requests for a fixed model quality, I am quite convinced that the inference costs are decreasing dramatically. Here's a model from late 2025 (see Model performance section) [1] with benchmarks comparing a 72B parameter model (Qwen2.5) from early 2025 to the late 2025 8B Qwen3 model.
The 9x smaller model outperforms the larger one from earlier the same year on 27 of the 40 benchmarks they were evaluated on, which is just astounding.
[1] https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct
++
Anecdotally, I find you can tell if someone worked at a big AI provider or a small AI startup by proposing an AI project like this:
" First we'll train a custom trillion parameter LLM for HTML generation. Then we'll use it to render our homepage to our 10 million daily visitors. "
The startup people will be like "this is a bad idea because you don't have enough GPUs for training that LLM" and the AI lab folks will be like "How do you intend to scale inference if you're not Google?"
What if we run out of GPU? Out of RAM? Out of electricity?
AWS is already raising GPU prices, that never happened before. What if there is war in Taiwan? What if we want to get serious about climate change and start saving energy for vital things ?
My guess is that, while they can do some cool stuff, we cannot afford LLMs in the long run.
> What if we run out of GPU?
These are not finite resources being mined from an ancient alien temple.
We can make new ones, better ones, and the main ingredients are sand and plastic. We're not going to run out of either any time soon.
Electricity constraints are a big problem in the near-term, but may sort themselves out in the long-term.
12 replies →
Your point could have made sense but the amount of inference per request is also going up faster than the costs are going down.
The parent said: "Of course, by then we'll have much more capable models. So if you want SOTA, you might see the jump to $10-12. But that's a different value proposition entirely: you're getting significantly more for your money, not just paying more for the same thing."
SOTA improvements have been coming from additional inference due to reasoning tokens and not just increasing model size. Their comment makes plenty of sense.
Is it? Recent new models tend to need fewer tokens to achieve the same outcome. The days of ultrathink are coming to an end, Opus is well usable without it.
1 reply →
> But inference costs are dropping dramatically over time, and that trend shows no signs of slowing. So even if a task costs $8 today thanks to VC subsidies, I can be reasonably confident that the same task will cost $8 or less without subsidies in the not-too-distant future.
I'd like to see this statement plotted against current trends in hardware prices ISO performance. Ram, for example, is not meaningfully better than it was 2 years ago, and yet is 3x the price.
I fail to see how costs can drop while valuations for all major hardware vendors continue to go up. I don't think the markets would price companies in this way if the thought all major hardware vendors were going to see margins shrink a la commodity like you've implied.
I've seen the following quote.
"The energy consumed per text prompt for Gemini Apps has been reduced by 33x over the past 12 months."
My thinking is that if Google can give away LLM usage (which is obviously subsidized) it can't be astronomically expensive, in the realm of what we are paying for ChatGPT. Google has their own TPUs and company culture oriented towards optimizing the energy usage/hardware costs.
I tend to agree with the grandparent on this, LLMs will get cheaper for what we have now level intelligence, and will get more expensive for SOTA models.
13 replies →
It's not the hardware getting cheaper, it's that LLMs were developed when we really didn't understand how they worked, and there is still some room to improve the implementations, particularly do more with less RAM... And that's everything from doing more with fewer weights to things like FP16, not to mention if you can 2x the speed you can get twice as much done with the same RAM and all the other parts.
3 replies →
> I'd like to see this statement plotted against current trends in hardware prices ISO performance.
Prices for who? The prices that are being paid by the big movers in the AI space, for hardware, aren't sticker price and never were.
The example you use in your comment, RAM, won't work: It's not 3x the price for OpenAI, since they already bought it all.
> I fail to see how costs can drop while valuations for all major hardware vendors continue to go up. I don't think the markets would price companies in this way if the thought all major hardware vendors were going to see margins shrink a la commodity like you've implied.
This isn't hard to see. A company's overall profits are influenced – but not determined – by the per-unit economics. For example, increasing volume (quantity sold) at the same per-unit profit leads to more profits.
> I fail to see how costs can drop while valuations for all major hardware vendors continue to go up.
yeah. valuations for hardware vendors have nothing to do with costs. valuations are a meaningless thing to integrate into your thinking about something objective like, will the retail costs of inference trend down (obviously yes)
> So even if a task costs $8 today thanks to VC subsidies, I can be reasonably confident that the same task will cost $8 or less without subsidies in the not-too-distant future.
The same task on the same LLM will cost $8 or less. But that's not what vendors will be selling, nor what users will be buying. They'll be buying the same task on a newer LLM. The results will be better, but the price will be higher than the same task on the original LLM.
[dead]
> Their pricing models are simply not sustainable. I hope everyone realizes that the current LLMs are subsidized, like your Seamless and Uber was in the early days.
If you run these models at home it's easy to see how this is totally untrue.
You can build a pretty competent machine that will run Kimi or Deepseek for $10-20k and generate an unlimited amount of tokens all day long (I did a budget version with an Epyc machine for about $4k). Amortize that over a couple years, and it's cheaper than most people spend on a car payment. The pricing is sustainable, and that's ignoring the fact that these big model providers are operating on economies of scale, they're able to parallelize the GPUs and pack in requests much more efficiently.
> run these models at home
Damn what kind of home do you live in, a data center? Teasing aside maybe a slightly better benchmark is what sufficiently acceptable model (which is not objective but one can rely on arguable benchmarks) you can run via an infrastructure that is NOT subsidized. That might include cloud providers e.g. OVH or "neo" clouds e.g. HF but honestly that's tricky to evaluate as they tend to all have pure players (OpenAI, Anthropic, etc) or owners (Microsoft, NVIDIA, etc) as investors.
Ignores the cost of model training, R&D, managing the data centers and more. OpenAI etc regularly admit that all their products lose money. Not to mention the fact that it isn't enough to cover their costs, they have to pay back all those investors while actually generating a profit at some point in the future.
Uhm, you actually just proved their point if you run the numbers.
For simplicity’s sake we’ll assume DeepSeek 671B on 2 RTX 5090 running at 2 kW full utilization.
In 3 years you’ve paid $30k total: $20k for system + $10k in electric @ $0.20/kWh
The model generates 500M-1B tokens total over 3 years @ 5-10 tokens/sec. Understand that’s total throughput for reasoning and output tokens.
You’re paying $30-$60/Mtok - more than both Opus 4.5 and GPT-5.2, for less performance and less features.
And like the other commenters point out, this doesn’t even factor in the extra DC costs when scaling it up for consumers, nor the costs to train the model.
Of course, you can play around with parameters of the cost model, but this serves to illustrate it’s not so clear cut whether the current AI service providers are profitable or not.
5 to 10 tokens per second is bungus tier rates.
https://developer.nvidia.com/blog/nvidia-blackwell-delivers-...
NVIDIAs 8xB200 gets you 30ktps on Deepseek 671B at maximum utilization thats 1 trillion tokens per year. At a dollar per million tokens that's $1 million.
The hardware costs around $500k.
Now ideal throughput is unlikely, so let's say your get half that. It's still 500B tokens per year.
Gemini 3 Flash is like $3/million tokens and I assume it's a fair bit bigger, maybe 1 to 2T parameters. I can sort of see how you can get this to work with margins as the AI companies repeated assert.
2 replies →
> Amortize that over a couple years, and it's cheaper than most people spend on a car payment.
I'm not parsing that: do you mean that the monthly cost of running your own 24x7 is less than the monthly cost of a car payment?
Whether true or false, I don't get how that is relevant to proving either that the current LLMs are not subsidised, or proving that they are.
If true it means there's a lower bound that is profitable at least taking into account current apparent purchasing costs and energy consumption.
I'm not sure. I asked one about a potential bug in iOS 26 yesterday and it told me that iOS 26 does not exist and that I must have meant iOS 16. iOS 26 was announced last June and has been live since September. Of course, I responded that 26 is the current iOS version is 26 and got the obligatory meme of "Of course, you are right! ramble ramble ramble...."
Was this a GPT model? OpenAI seems to have developed an almost-acknowledged inability to usefully pre-train a model after mid-2024. The recent GPT versions are impassively lacking in newer knowledge.
The most amusing example I’ve seen was asking the web version of GPT-5.1 to help with an installation issue with the Codex CLI (I’m not an npm user so I’m unfamiliar with the intricacies of npm install, and Codex isn’t really an npm package, so the whole use of npm is rather odd). GPT-5.1 cheerfully told me that OpenAI had discontinued Codex and hallucinated a different, nonexistent program that I must have meant.
(All that being said, Gemini is very, very prone to hallucinating features in Google products. Sometimes I wonder whether Google should make a list of Gemini-hallucinated Google features and use the list to drive future product development.)
Gemini is similar. It insists that information from before its knowledge cutoff is still accurate unless explicitly told to search for the latest information before responding. Occasionally it disagrees with me on the current date and makes sarcastic remarks about time travel.
One nice thing about Grok is that it attempts to make its knowledge cutoff an invisible implementation detail to the user. Outdated facts do sometimes slip through, but it at least proactively seeks out current information before assuming user error.
LLMs solve the naming problem now there are just 1 things wrong with software development. I can't tell if its a really horrible idea that ultimately leads to a trainwreck or freedom!
Sure. You have to be mindful of the training cut off date for the model. By default models won't search the web and rely on data baked into their internal model. That said the ergonomics of this is horrible and a huge time waste. If I run into this situation I just say "Search the web".
If the traning cutoff is before iOS 26 then the correct answer is 'i don't know anything about it, but it is reasonable to think it will exist soon'. saying 'of course you are right' is a lie
1 reply →
That will only work as long as there is an active "the web" to search. Unless the models get smart enough to figure out the answer from scratch.
Let's imagine a scenario. For your entire life, you have been taught to respond to people in a very specific way. Someone will ask you a question via email and you must respond with two or three paragraphs of useful information. Sometimes when the person asks you a question, they give you books that you can use, sometimes they don't.
Now someone sends you an email and asks you to help them fix a bug in Windows 12. What would you tell them?
I would say "what the hell is windows 12". And definitely not "but of course, excellent question, here's your brass mounted windows 12 wheeler bug fixer"
I mean I would want to tell them that windows 11 is the most recent version of windows… but also I’d check real quick to make sure windows 12 hadn’t actually come out without me noticing.
1 reply →
The other way around, but a month or so ago Claude told me that a problem I was having was likely caused by ny fedora version "since fedora 42 is long deprecated".
> The other way around, but a month or so ago Claude told me that a problem I was having was likely caused by ny fedora version "since fedora 42 is long deprecated".
Well, obviously, since Fedora 42 came out in 1942, when men still wore hats. Attempting to use such an old, out of style Linux distro is just a recipe for problems.
1 reply →
You are better off talking to Google's AI mode about that sort of thing because it runs searches. Does great talking about how the Bills are doing because that's a good example where timely results are essential.
I haven't found any LLM where I totally trust what it tells me about Arknights, like there is no LLM that seems to understand how Scavenger recovers DP. Allegedly there is a good Chinese Wiki for that game which I could crawl and store in a Jetbrains project and ask Junie questions about but I can't resolve the URL.
Even with search mode, I’ve had some hilarious hallucinations.
This was during the Gemini 2.5 era, but I got some just bonkers results looking for Tears of the Kingdom recipes. Hallucinated ingredients, out-of-nowhere recipes, and transposing Breath of the Wild recipes and effects into Tear of the Kingdom.
1 reply →
Which one? Claude (and to some extent, Codex) are the only ones which actually work when it comes to code. Also, they need context (like docs, skills, etc) to be effective. For example: https://github.com/johnrogers/claude-swift-engineering
I've been explaining that to people for a bit now as well as a strong caution for how people are pricing tools. It's all going to go up once dependency is established.
The AWS price increase on 1/5 for GPU's on EC2 was a good example.
AWS in general is a good example. It used to be much more affordable and better than boutique hosting. Now AWS costs can easily spiral out of control. Somehow I can run a site for $20 on Digital Ocean, but with AWS it always ends up $120.
RDS is a particular racket that will cost you hundreds of dollars for a rock bottom tier. Again, Digital Ocean is below $20 per month that will serve many a small business. And yet, AWS is the default goto at this point because the lockin is real.
> RDS is a particular racket that will cost you hundreds of dollars for a rock bottom tier. Again, Digital Ocean is below $20 per month that will serve many a small business. And yet, AWS is the default goto at this point because the lockin is real.
This is a little disingenuous though. Yeah you can run a database server on DO cheaper than using RDS, but you’ll have to roll all that stuff that RDS does yourself: automatic backups/restores, tuning, monitoring, failover, etc. etc. I’m confident that the engineers who’ve set up those RDS servers and the associated plumbing/automation have done a far better job of all that stuff than I ever could unless I spent a lot of time and effort on it. That’s worth a premium.
Yep. The goal is to build huge amounts of hype and demand, get their hooks into everyone, and once they've killed off any competition and built up the walls then they crank up the price.
The prices now are completely unsustainable. They'd go broke if it weren't for investors dumping their pockets out. People forget that what we have now only exists because of absurd amounts of spending on R+D, mountains of dev salaries, huge data centers, etc. That cannot go on forever.
The pricing will go down once the hardware prices go down. Historically hardware prices always go down.
Once the hardware prices go low enough pricing will go down to the point where it doesn't even make sense to sell current LLMs as a service.
I would imagine that it's possible that if ever the aforementioned future comes to pass that there will be new forms of ultra high tier compute running other types of AI more powerful than an LLM? But I'm pretty sure AI at it's current state will one day be running locally on desktops and/or handhelds with the former being more likely.
Are Hardware prices going down when the next generations get less and less better?
Yeah it’s not just a demand side thing. Costs go down as well. Every leap in new hardware costs a lot in initial investment and that’s included in a lot of the pricing.
Hopefully we'll get some real focus on making LLMs work amazingly well with limited hardware.. the knock on effect of that would be amazing when the hardware eventually drops in price.
We're building a house on sand. Eventually the whole damn thing is going to come crashing down.
It would mean that inference is not profitable. Calculating inference costs show it's profitable, or close to.
Inference costs have in fact been crashing, going from astronomical to... lower.
That said, I am not sure that this indicator alone tells the whole story, if not hides it - sort of like EBITDA.
I think there will still be cheap inference, what will rise in costs will be frontier model subscriptions. This is the thing that is not profitable.
>I hope everyone realizes that the current LLMs are subsidized
This is why I'm using it now as much as possible to build as much as possible in the hopes of earning enough to afford the later costs :D
> I hope everyone realizes that the current LLMs are subsidized, like your Seamless and Uber was in the early days.
A.I. == Artificially Inexpensive
> I hope everyone realizes that the current LLMs are subsidized
Hell ya, get in and get out before the real pricing comes in.
"I'm telling ya kid, the value of nostalgia can only go up! This is your chance to get in on the ground-floor so you can tell people about how things used to be so much better..."
Wait for the ads
On the bright side, I do think at some point after the bubble pops, we’ll have high quality open source models that you can run locally. Most other tech company business plans follow the enshittification cycle [1], but the interchangeability of LLMs makes it hard to imagine they can be monopolized in the same way.
1: I mean this in the strict sense of Cory Doctorow’s theory (https://en.wikipedia.org/wiki/Enshittification?wprov=sfti1#H...)
Except most of those services don't have at-home equivalents that you can increasingly run on your own hardware.
I run models with Claude Code (Using the Anthropic API feature of llama.cpp) on my own hardware and it works every bit as well as Claude worked literally 12 months ago.
If you don't believe me and don't want to mess around with used server hardware you can walk into an Apple Store today, pick up a Mac Studio and do it yourself.
I’ve been doing the same with GPT-OSS-120B and have been impressed.
Only gotcha is Claude code expects a 200k context window while that model max supports 130k or so. I have to do a /compress when it gets close. I’ll have to see if there is a way to set the max context window in CC.
Been pretty happy with the results so far as long as I keep the tasks small and self contained.
1 reply →
Whats your preferred local model?
They just need to figure out KV cache turned into a magic black box after that it'll be fine
The results are repeatable. Models are performing with predictable error rates on the tasks that these models had been trained and tested.
AI is built to be non-deterministic. Variation is built into each response. If it wasn't I would expect AI to have died out years ago.
The pricing and quality on the copilot, codex (which I am experienced in) feels like it is getting worse, but I suspect it may be my expectations are getting higher as the technology is maturing...