Comment by siliconc0w

5 months ago

All these new datacenters are going to be a huge sunk cost. Why would you pay OpenAI when you can host your own hyper efficient Chinese model for like 90% less cost at 90% of the performance. At that is compared to today's subsidized pricing, which they can't keep up forever.

5 comments

siliconc0w

hadlock 5 months ago

Eventually Nvidia or a shrewd competitor will release 64/128gb consumer cards; locally hosted GPT 3.5+ is right around the corner, we're just waiting for consumer hardware to catch up at this point.

mft_ 5 months ago
I think we're still at least an order of magnitude away (in terms of affordable local inference, or model improvements to squeeze more from less, or a combination of the two) from local solutions being seriously competitive for general purpose tasks, sadly.
I recently bought a second-hand 64GB Mac to experiment with. Even with the biggest recent local model it can run (llama3.3:70b just about runs acceptably; I've also tried an array of Qwen3 30b variants) the quality is lacking for coding support. They can sometimes write and iterate on a simple Python script, but sometimes fail, and for general-purpose models, often fail to answer questions accurately (not unsurprisingly, considering the model is a compression of knowledge, and these are comparatively small models). They are far, far away from the quality and ability of currently available Claude/Gemini/ChatGPT models. And even with a good eBay deal, the Mac cost the current equivalent of ~6 years of a monthly subscription to one of these.
Based on the current state of play, once we can access relatively affordable systems with 512-1024GB fast (v)ram and sufficient FLOPs to match, we might have a meaningfully powerful local solution. Until then, I fear local only is for enthusiasts/hobbyists and niche non-general tasks.
- hadlock 5 months ago
  
  It would not surprise me at all to see 512, 768, 1024 gb models targeted at commercial or home users in the next 5 years. I can imagine a lot of companies, regulated ones in particular like finance, defense, medical, wanting to run the models in house, inside their own datacenter. A single card or pair of cards would probably be more than adequate for a thousand or more users, or half a dozen developers. If you already have a $25,000 database server, $12,000 for an "ai server" isn't a wild ask.

GaggiX 5 months ago

>to today's subsidized pricing, which they can't keep up forever.

The APIs are not subsidized, they probably have quite the large margin actually: https://lmsys.org/blog/2025-05-05-large-scale-ep/

>Why would you pay OpenAI when you can host your own hyper efficient Chinese model

The 48GB of VRAM or unified memory required to run this model at 4bits is not free either.

siliconc0w 5 months ago

I didn't say its free but it is about 90% cheaper. Sonnet is $15 per million token output, this just dropped and is available at OpenRouter at $1.40. Even compared to Gemini Flash which is probably the best price-to-performance API is generally ranked lower than Qwen's models and is $2.50 so still %44 cheaper.