Comment by pama
7 hours ago
Frankly, everyone in the industry knows. When people make these statements without additional clarity they always talk about API prices. You can look at the NVL72 specs and make estimates for electricity and ownership costs rather easily. Inference at data-center scale is dirt cheap, even with public codes using dynamo and sglang. The mystery is why the early misconceptions about inefficient inference persisted even after NVIDIA was very open about everything they did to help reduce costs dramatically in the last two years.
I imagine it's the lack of transparency. The costs are obviously coming down as people figure out how to tune both hardware and software. But there are costs other than just electricity as well. For example, chips do burn out, I recall reading that 2 to 3 years is roughly what you can expect under inference loads, so replacing chips is a non trivial operational cost.
Also, as the costs of running this stuff come down, the incentive to rent models goes down with them. Running local models has the benefit that you get to keep your data local, you can tune them to do what you like, and you're not subject to model or price changes down the road. This makes self hosting appealing both to individuals and companies. Currently, the barrier is in needing significant resources to run the models, but companies are already increasingly doing that with open models. And local inference that regular people can run is becoming a possibility as well.
While I'm sure there's always going to be a market for renting out models as a service, it may shrink significantly as the costs continue to come down.