Comment by mmmllm
14 hours ago
Sure but where is the demand going to come from? LLMs are already in every google search, in Whatsapp/Messenger, throughout Google workspace, Notion, Slack, etc. ChatGPT already has a billion users.
Plus penetration is already very high in the areas where they are objectively useful: programming, customer care etc. I just don't see where the 100-1000x demand comes from to offset this. Would be happy to hear other views.
As plenty of others have mentioned here, if inference were 100x cheaper, I would run 200x inference.
There are so many things you can do with long running, continuous inference.
but what if you don't need to run it in the cloud
You will ALWAYS want to use the absolute best model, because your time is more valuable than the machine's. If the machine gets faster or more capable, your value has jumped proportionally.
If LLMs were next to free and faster I would personally increase my consumption 100x or more and Im only the "programming" category.
We are nearly infinitely far away from saturating compute demand for inference.
Case in point; I'd like something that realtime assesses all the sensors and API endpoints of stuff in my home and as needed bubbles up summaries, diaries, and emergency alerts. Right now that's probably a single H200, and well out of my "value range". The number of people in the world that do this now at scale is almost certainly less than 50k.
If that inference cost went to 1%, then a) I'd be willing to pay it, and b) there'd be enough of a market that a company could make money integrating a bunch of tech into a simple deployable stack, and therefore c) a lot more people would want it, likely enough to drive more than 50k H200s worth of inference demand.
Do you really need a H200 for this? Seems like something a consumer GPU could do. Smaller models might be ideal [0] as they don't require extensive world knowledge and are much more cost efficient/faster.
Why can't you build this today?
[0]: https://arxiv.org/pdf/2506.02153 Small Language Models are the Future of Agentic AI (Nvidia)
Is all of that not achievable today with things like Google Home?
It doesn’t sound like you need to run a H200 to bridge the gap between what currently exists and the outcome you want.
Sure but if that inference cost went to 1%, then Oracle and Nvidia's business model would be bust. So you agree with me?
absolutely nobody wants or needs a fucking thermostat diary lmao, and the few ppl that do will have zero noticeable impact on world's compute demands, i'm begging ppl in on hn to touch grass or speak to an average person every now and then lol
its pretty easy to dispute and dismiss a single use case for indiscriminate/excessive use of inference to achieve some goal, as you have done here, but its hard to dispute every possible use case
You wouldn't even know that it existed, or how it worked. It would just work. Everybody wants hands off control that they don't have to think or learn about.
edit: this reminds me of a state agency I once worked for who fired their only IT guy after they moved offices, because the servers were running just fine without him. It was a Kafkaesque trauma for him for a moment, but a massive raise a week later when they were renegotiating for him to come back.
> Plus penetration is already very high in the areas where they are objectively useful: programming, customer care etc.
Is that true? BLS estimates of customer service reps in the US is 2.8M (https://www.bls.gov/oes/2023/may/oes434051.htm), and while I'll grant that's from 2023, I would wager a lot that the number is still above 2M. Similarly, the overwhelming majority of software developers haven't lost their jobs to AI.
A sufficiently advanced LLM will be able to replace most, if not all of those people. Penetration into those areas is very low right now relative to where it could be.
Fair point - although there are already so many customer facing chatbots using LLMs rolled out already. Zendesk, Intercom, Hubspot, Salesforce service cloud all have AI features built into their workflows. I wouldn't say penetration is near the peak but it's also not early stage at this point.
In any case, AI is not capable of fully replacing customer care. It will make it more efficient but the non-deterministic nature of LLMs mean that they need to be supervised for complex cases.
Besides, I still think even the inference demand for customer care or programming will be small in the grand scheme of things. EVERY Google search (and probably every gmail email) is already passed through an LLM - the demand for that alone is immense.
I'm not saying demand won't increase, I just don't see how demand increases so much that it offsets the efficiency gains to such an extent that Oracle etc are planning tens or hundreds of times the need for compute in the next couple of years. Or at least I am skeptical of it to say the least.
We've seen several orders of magnitude improvements in cpus over the years, yet you try to do anything now and interaction is often slower than that on zx spectrum. We can easily fill in order of magnitude improvement and that's only going to create more demand. We can/will have models thinking for us all the time, in parallel and bother us with findings/final solutions only. There is no limit here really.
I’m already throughput-capped on my output via Claude. If you gave me 10x the token/s I’d ship at least twice as much value (at good-enough for the business quality, to be clear).
There are plenty of usecases where the models are not smart enough to solve the problem yet, but there is very obviously a lot of value available to be harvested from maturing and scaling out just the models we already have.
Concretely, the $200/mo and $2k/ mo offerings will be adopted by more prosumer and professional users as the product experience becomes more mature.
The difference in usefulness between ChatGPT free and ChatGPT Pro is significant. Turning up compute for each embedded usage of LLM inference will be a valid path forward for years.
The problem is that unless you have efficiency improvements that radically alter the shape of the compute vs smartness curve, more efficient compute translates to much smarter compute at worse efficiency.
If you can make an LLM solve a problem but from 100 different angles at the same time, that's worth something.
Isn't that essentially how the MoE models already work? Besides, if that were infinitely scalable, wouldn't we have a subset of super-smart models already at very high cost?
Besides, this would only apply for very few use cases. For a lot of basic customer care work, programming, quick research, I would say LLMs are already quite good without running it 100X.
MoE models are pretty poorly named since all the "experts" are "the same". They're probably better described as "sparse activation" models. MoE implies some sort of "heterogenous experts" that a "thalamus router" is trained to use, but that's not how they work.
> if that were infinitely scalable, wouldn't we have a subset of super-smart models already at very high cost
The compute/intelligence curve is not a straight line. It's probably more a curve that saturates, at like 70% of human intelligence. More compute still means more intelligence. But you'll never reach 100% human intelligence. It saturates way below that.
2 replies →
MoE is something different - it's a technique to activate just a small subset of parameters during inference.
Whatever is good enough now, can be much better for the same cost (time, computation, actual cost). People will always choose better over worse.
2 replies →
I mean 640KB should be enough for anyone too but here we are. Assuming LLMs fulfill the expected vision, they will be in everything and everywhere. Think about how much the internet has permeated everyday life. Even my freaking toothbrush has WiFi now! 1000x demand is likely several orders of magnitude too low in terms of the potential demand (again, assuming LLMs deliver on the promise).
Long running agents?