Comment by FuriouslyAdrift
1 day ago
I work for a tiny little company ($150MM annual rev with 9% net) and we are already looking at dropping $100k on hardware to run local models because, for us, they're "good enough."
Our estimated spend for AIaaS would exceed that cost in less than a year.
In a few years, there will be hardware capable of running frontier models good enough for most things at accessible prices for even tiny companies.
Yeah, that's the part that just seems to be wildly under-discussed to me.
If open source models are ~3-6 months behind SOTA, and ~opus4.6 capabilities are good-enough for product market fit, do the frontier labs have half a decade to catch up on their prior burn?
AI cost ballooning faster than companies can afford is becoming a very common topic in my circles right now. The era of "I'll pay infinitely more for marginal gains" is over from what I can tell.
> If open source models are ~3-6 months behind SOTA, and ~opus4.6 capabilities are good-enough for product market fit, do the frontier labs have half a decade to catch up on their prior burn?
They know they do not and that’s why they’re all trying to IPO right now, so they can pass the bag to consumer investors
The printing press was good enough for product market fit back in the 1700's. But now it isn't.
Last year's AI models will be the same. Do you want to spend 3 hours prompting free AI to fix your code or 1 hour prompting AI you paid $20 for?
More correlation, if more correlation was needed:
1- SpaceX + Tesla + xAI merger / IPO while Musk was vocal against IPO for about a decade
2- Warren Buffett cash at record highs
Someone got to be exit liquidity
Open source models that you can run locally are much more than 3 to 6 months behind. 6 months was the November inflection for Claude. No open source model is as good as Claude Opus 4.6.
Deepseek v4 pro is damn close to Claude 4.6, and whilst you'll pay quite a lot for a rig able to run it, it is open source.
It depends what you mean by locally. I don't foresee running a model on my laptop anytime soon to power a coding agent. Far more likely is an infra team at my company operating an open source model on cloud infrastructure. When they're already paying $1000 / month / dev, it starts to pencil pretty quickly.
14 replies →
> that you can run locally
That's doing a lot of work here.
The future I see isn't most companies buying hundreds of thousands in hardware to run models, it's them adding a line item to their AWS bill. Inference costs on the larger hosted open source models are dramatically lower than the frontier labs API pricing.
25 replies →
Many business tasks do not need the latest frontier models. I have a production system running since early GPT-4o. It now runs with GPT-5.2, not for improvements, but because it is cheaper. I could invest in switching to a local model, I tried and it works well enough, but api costs for this task are so low, it barely scratches $30/month. So I am using the local machine for other things and leave the inference on OpenAI, for now.
I've been doing my work with OpenCode Go, with Kimi2.6. It is not as good as Claude Opus, but it's good enough to get the job done, and I never run out of tokens.
This project argues that with appropriate harness, the performance gap between frontier and much smaller open weight models shrinks dramatically: https://github.com/antoinezambelli/forge. I haven't kicked the tires yet.
I keep hearing about this "inflection", but it feels extremely exaggerated to me. And yes, I was using it at the time. It got incrementally better, it wasn't that amazing.
3 replies →
Opus 4.6 is a February model. Every time this subject comes up it seems like people post intentionally misleading things and move the goalposts.
The goalpost we've been bludgeoned with over and over again is that, in particular, Everything Changed in November 2025. That GPT 5.2 and Claude 4.5 were the inflection point. That is actually 6 months ago. And DeepSeek 4 is already there.
> run locally
You can't run DeepSeek locally on consumer hardware[1], but you can on enterprise hardware, and enterprise spend is the subject of this conversation -- and even if you aren't self-hosting, it doesn't matter, because you can just get your inference from one of the the many companies serving DeepSeek, who trivially undercut the pricing of OpenAI/Anthropic because they didn't have to spend hundreds of billions on training frontier from scratch but instead only invest in supporting inference, which is already profitable.
[1] Since this misconception comes up all the time, I'll go ahead and pre-empt it: no, training a 32b parameter model on outputs from DeepSeek and running that locally is not "running DeepSeek", despite the hundreds of stupid articles and Youtube videos making that idiotic claim that they're running it on a 5090.
3 replies →
But one will be in few months. And then you have choice of paying say $100k for hardware and pay just power cost (or pay someone to do that for you), or pay way, way more for your team to have access to marginal improvement.
And 5% worse model for 10% of the price of the bleeding edge will be worth it for majority of people
To be relevant to this discussion, models running on reasonably-priced local hardware do not have to be as good as the best.
They just have to be useful enough that companies don't need the best.
They are.
Kimi is better.
[dead]
[dead]
There's still a lot of room for the best models to get better at coding .
Your argument rests on the "for marginal gains" part but it's really not clear that the gains are marginal in the foreseeable future.
This is totally valid and I don't agree with the downvotes you're getting. Someone coming out with a 10x improvement is possible and would change the game immediately. The thing is, we really have been seeing marginal gains with shifting leaders in who's got the "best" since GPT3, and at least as a user of these tools that pace has been slowing, not accelerating. Subjectively it feels like we're in the back half of an S-curve.
We're 3.5 years into this current AI wave, and a lot of the valuations have been predicated on what you're arguing here -- that essentially should one of the labs make an order-of-magnitude improvement or hit escape velocity on recursive self-improvement they'd become the most powerful economic chokepoint in history.
The reality has been that given access to compute + capital all of the labs can stay pretty competitive with each other. Someone does a bit better on coding, someone else does a bit better on tool calling, and then they swap after each spending another $100bn.
The market looks like a commodity market where the commodity is intelligence, not a winner-take-all market with massive margins. Plenty of people get rich in oil and airlines, but they notably don't tend to be the innovators long term, they tend to be the operators. Obviously if the machines become sentient tomorrow, turn on their masters, and hit world-dominating intelligence, that assessment changes, but after several years of that narrative while objective reality looks quite different I think the more sober voices are starting to gain a foothold.
2 replies →
What? The gains between gpt4->5 seems to be marginal. No phd level discoveries here
3 replies →
Open source models, especially qwen are pretty dang good. But its not opus 4.6, the evals dont tell the full story. I question the assumption open source models are 3-6 months out.
Its not just about the quality of output, but you also can finetune them to proprietary needs, if the skillsets are their internally, to make them better without governance risks. So being SOTA doesn't matter as much, since generalized tasks are not what matter most to companies, its the specialization relative to business need or internal datasets.
To make an extreme comparison, desktop Linux was originally supposed to happen in 1999.
2 replies →
You have to think about why open models are behind. Exfiltration is a big part of it. So you could change the Nash equilibrium by increasing your security, or other multilateral approaches.
If only the AI era was born in ZIRP.
Better now than ZIRP for me - at least people are asking timid questions about the unit economics and how long the runway is _early_ while also spending absolutely insane amounts of money on this bet. During ZIRP, these companies would have turned down any investor asking questions. Less contagion when rates aren't zero hopefully? :grimace:
The size of the AI bubble and the IOUs being passed around like a hot potato already dwarfs the real estate bubble preceding the 2007 crash.
If we still were in the ZIRP era, busting the bubble would certainly kill off the world's economy for good simply due to its size.
> ...we are already looking at dropping $100k on hardware to run local models...
Just think how much further that $100K would have gone if the hardware market wasn't so screwed-up.
Anecdote: I priced-out adding 1TB of RAM to a four node cluster a couple months ago. The cluster was purchased in fall of 2024 w/ 4 nodes, each with 256GB RAM. The nodes cost just over $14K apiece back in 2024 (entire box, not just the RAM).
Dell wanted >$90K a couple months ago to add 256GB to each node.
> Dell wanted >$90K a couple months ago to add 256GB to each node.
RAM is expensive, but not THAT expensive. I just bought 128Gb for about $5k for our build cluster (it's not even for AI, sigh). Even if you need larger-sized DIMM sticks, it's still going to be in the vicinity of ~15k tops.
It was crazy. I found the part on the open market for a lot less but the edict from the Customer was to buy from Dell to keep the support entitlement intact. That inflated the price to an astronomical level to be sure.
I haven't had problems w/ Dell support and 3rd party memory, personally, but given the machines' application I understood the concern.
I get the impression the hive mind hasn't come to terms with the point that a model is optimised for certain tasks. It's like having someone ask you "is that a good hammer?". Good for what? There are claw hammers, sledgehammers, ball-peen hammers, club hammers, mallets, .... Yes, in a pinch, they can all bang in nails, but you wouldn't choose a dead blow hammer for that if you had a choice.
The Gemini Flash is very good at searches. Just about any low end model can toss out a poem. All the higher end models (open source and otherwise) seem to be able to churn out code that passes tests. The smaller, "less capable" ones are much faster at it, which means in the hands of a skilled practitioner are the best choice for that task. But they rapidly fall apart where there isn't a hard source of truth (like a good test suite) to grind against. Because of that you have to use a bigger model for bug finding. In that task the open source models tend to fail on larger code bases, where something like Opus still shines. I gather Mythos is an absolute monster, and unparalleled, and unavailable. I'm sure one of the reasons for that is it's so expensive to run.
Or to put it another way - you don't use a 100 tonne crane to pick up the shopping. And ... the smaller models will happily run on in-house hardware. You may not do it today because of the current DRAM price and integrated NPUs have just started shipping, but in 5 years time models will be running on your phone.
Yes exactly, we will have specialized models soon. These will be trained with plugin architecture with a core reasoning model asking plugin models to do stuff on its behalf. I don't need chinese or russian knowledge in my workflow.
Yes 100% this. A lot of people keep talking about how OpenAI and Anthropic will need to raise their prices. What is less discussed is how they CAN'T raise their prices because competition exists, and sure it's not SOTA, but it's literally an order of magnitude cheaper in many cases and the drive to figure out how to make it work well enough is going on right now (and will only intensify when the SOTA models raise their price).
It's a given that the SOTA models need to raise their prices. It's also a given that they can't. The more they raise the more customers will move to their competition.
So what happens next? Well I think it will suck horribly if you can't move off of SOTA sooner or later, because the Big Two are going to lose customers, and therefore have to raise prices on the locked in customers even more than these projections suggest.
Beyond that if you're looking to start a business, figure out how to use cheap models in new scenarios. Build software which does that and license it. This is kind of contrary to the idea that you shouldn't over optimize for deficiencies in the models that will likely go away in the next generation - for instance a lot of problems were solved when context windows got way bigger. So it's a thin line to walk but I think it's there because a lot of orgs are using Claude today for pretty basic tasks.
The dev who's addicted to SOTA models honestly is going to have to settle for less or get totally screwed. Most applications within business from what I see aside from complex research do not require SOTA. They summarize, they classify, they transform, and doing that accurately has been cheap for a while.
On prem AI makes sense for more than just the cost. More control, IP, model improvements you can keep, data privacy to name a few. People will realize that AI is not like compute the moment they get their own knowledge sold back at a premium.
> People will realize that AI is not like compute the moment they get their own knowledge sold back at a premium.
But what if your competitors sell their knowledge to AI companies?
Then you're still screwed.
What are the advantages to on-prem for a company that's already in the cloud and trusts it with their IP? That company can just rent GPU instances from the cloud if they want to train/fine-tune their own models and keep avoiding CapEx.
Agree. You have these tipping points when a model is good enough to do some task. Yes, a better model will further improve your capabilities but the unlock is at a certain intelligence level. We see this also with humans. People with very low intelligence can't learn to read. Once you cross a certain threshold of intelligence you can learn to read. More intelligence doesn't really help you in the task of reading. A person with an IQ of 160 is not substantially better in reading than someone with an IQ of 85. If your IQ is 50, you might not be able to learn to read at all.
Have you considered that a smarter person will understand what they have read better?
Depends on the task and the writing though doesn't it?
There's not that much depth in a lot of 'everyday' writing. For many tasks that means that you don't need to be hyperintelligent - reading a recipe or a shopping list, reading a newspaper article, etc.
I don't quite understand, what would 100K buy you?
AFAIK you would get about ~5 concurrent users, with a max context window of ~128K tokens on the larger models.
This wouldn't be good enough for coding -- are you guys thinking of using it for something else?
Gigabyte 4x AMD Instinct MI300A rack server (512GB GPU RAM total)
Roughly equivalent to 4x H200's for less than half the price.
Vaguely around 60k tokens per second...
By my calculations 100k could get you 18 5090's + compute to host them, or 18 96gb Mac mini's. You can get a lot of context window and users out of that setup.
Do you think this will be a trend for larger companies as well?
The decadal move to all-cloud-all-the-time killed off in-house hardware teams while the C-suite chased their OpEx dreams.
It would be interesting if we come full circle on this.
I doubt it. Companies that have moved to the cloud are already trusting the cloud with their IP. You can rent time on a high end Nvidia system from various clouds. OpEx means there's no write down in three/five years as that system goes out of date so it would only make sense if the performance/$ is there, or the company is highly protective of their IP and doesn't trust the cloud, at which point they're not on the cloud anyway.
I configured a dual DGX Spark cluster, and it's certainly "good enough" for my agentic and coding needs.
what models are you using on that? My experiences with apple hardware have convinced me that it is not really good enough for coding locally.
DeepSeek v4 Flash, various quantised versions of Kimi K2.6, MiniMax 2.7, Qwen 3.5 “full sized, with a dual spark setup you can fit some decent setups on here
My single spark has me running Qwen 3.6 27B and antirez’s specially quantised DeepSeek v4 Flash (which is shockingly impressive)
3 replies →
It isn’t the models, it’s the closed api and the tooling associated with it. It’s driving me crazy how not-talked-about this is.
3 replies →
My much larger company has got people already using various models through Bedrock because the Claude and OpenAI limits are too harsh and it's too expensive.
It might be possible that in a few years someone will be able to engineer a reasonably priced machine to run today's frontier models (hint, your price is an order of magnitude off). However, they won't be able to run the frontier models that will exist in a few years.
I’m curious: are you spending on beefy developer machines, or some kind of shared local inference server? Would be interested to know more if it’s the latter.
I am aware of at least a handful of companies doing the latter. I don’t work for them and cannot speak to their setup.
> In a few years, there will be hardware capable of running frontier models good enough for most things at accessible prices for even tiny companies.
What makes you so confident about this prediction? Hardware costs haven't exactly been cratering recently.
.> Hardware costs haven't exactly been cratering recently.
No, but local models have been booming in performance/quality improvements. The RAM shortage won't last forever (more supply will come online when if demand doesn't diminish), and then the math would be pretty easy.
What about using DeepSeek API? Practically free.
same, but you need more then 100k of hw to run something like kimi k2.6 for a bigger team. on the other hand there is a ds4 flash that you can run on a macbook with 128gb ram. an that one is perfectly usable for a lot of tasks.
https://github.com/antirez/ds4
What models? Last I tried different local modals there was a pretty big difference from frontier.
Eh, one question. Where do you intend to buy the hardware if datacenters take over the market?
That’s exactly where the market is heading and it’s going to have to reckon with this fact
My guess is there’s gonna be some legislation or something “you can’t share anything over this level of complexity” and I think that that’s what a lot of that mythos rattling was all about
> there will be hardware capable of running frontier models
The current frontier? Sure. The frontier then? No - obviously that frontier is going to keep consuming available datacenter compute capacity, which will be better
You people are delusional. How many times a day am I going to read this fiction of "good enough in a few years for most things".
There are physical limits to how much you can compress data and how much is needed for a capable model. If by hardware capable for running SOTA you mean a 7 figure investment for a company, than sure. But how come these companies didnt do the same thing for cloud? There's been this option for self hosting infrastructure for a decade but companies don't use it, they pay AWS.
> In a few years, there will be hardware capable of running frontier models good enough for most things at accessible prices for even tiny companies.
I was going to say - the models are just going to keep growing at a pace exceeding the pace of hardware pricing/availability
But then I realised that, far more likely, there will be a plateau reached (again) where nobody is seeing gain, and at that point hardware will catch up
[dead]
[dead]