Comment by CubsFan1060
12 hours ago
Knowing very little about how to run these, how close are we to medium or larger businesses starting to buy hardware to run models like this to keep the models local?
It’s expensive, and not as capable as the frontier models, but would have some pretty big benefits around privacy and agency.
I know of multiple businesses in Europe that have been doing that for a while with 70B models, and are upgrading hardware to run the new crop of 700B-1T models (really started around Kimi K2, but buying and hosting that kind of hardware takes time)
Not everyone is willing (or even legally able) to send their trade secrets to OpenAI or Anthropic
While certainly there are such cases with trade secrets, it's worth noting that even large banks typically have a provider like Azure or AWS onboarded.
There they can deploy these models while using the existing legal frameworks.
What kind of hardware/price does it take to run those?
Nvidia will sell you an entire server rack ready for inference. Or maybe you can roll out your own Blackwell based system.
We’re approaching a world where running a primer frontier model is possible on a workstation, probably will have something under $30k that looks like a desktop for Nvidia’s next generation. It sounds expensive, until you look at your Anthropic bill.
It’s similar unit economics as could computing for the open models. You can save a ton on the expenses by buying the hardware, but it requires a lot of in-house expertise, and you get the most value if you keep the system operating around the clock. The big kink is open models are usually 2 quarters behind frontier, and your competitors are probably trying to get access to mythos.
3 replies →
For an 8-bit quant (what people call "near lossless") you are looking at something like 4xMI350X, which comes out to about $150k after adding the rest of the server. More if you go with Nvidia instead of AMD. More if you want more than maybe 8x concurrency
But prices are changing rapidly, and not for the better
This is not a new situation. This was happening also when good vision models like alexa net were coming through, especially for OCR. Companies had choice between cloud or self hosting with GPUs. But turns out, problem is usage patterns.
Your usage will peak during certain timezone work hours(even if you are a huge multinational company most of your engineers/users tend to be from only a few locations), so then you have a bunch of gpus doing nothing the rest of the day. especially with latency sensitive stuff, this is a decades old tradeoff problem, its not unique to llms
It’s a ~750B model so still a hell of a lot of vram
Would need to be a pretty determined medium biz
So far there seems to be one major use-case for complete privacy, and that is legal work. You don't need top of the line models to search vast amounts of text in discovery and it needs to be completely confidential. There's quite a few lawyers over on r/localllama showing off their multi-GPU builds. Coincidentally they also have the vast funding required for it.
Unless you have genuine national security concerns, you’d be better off just negotiating a commercial agreement with privacy protections with a couple of existing vendors.
I think that's true until it isn't, which may end up being the problem. Fable/Mythos doesn't fall under the ZDR agreements with Anthropic. And I'm curious if others will follow suit.
if you can afford the investment you get stable low costs for years with better security (at least if your cyber team is good). its even better in regulated industries where some vendors might add a premium for hipaa/soc/pci dss compliance to the point its a lot cheaper to self host. for a smaller business its not worth it and you should just use a hosted open model.
> to the point its a lot cheaper to self host
I'm pretty skeptical, especially given typical utilization patterns. Do you have numbers, or this is just vibes?
> how close are we to medium or larger businesses starting to buy hardware to run models like this to keep the models local?
Years.
Even Microsoft said they don't have enough for Github and need to call Amazon.
Getting a few even at decent prices is hard. Unless the shortages goes down...