Comment by all2

19 days ago

I'm assuming GP means 'run inference locally on GPU or RAM'. You can run really big LLMs on local infra, they just do a fraction of a token per second, so it might take all night to get a paragraph or two of text. Mix in things like thinking and tool calls, and it will take a long, long time to get anything useful out of it.

4 comments

all2

hxtk 18 days ago

I’ve been experimenting with this today. I still don’t think AI is a very good use of my programming time… but it’s a pretty good use of my non-programming time.

I ran OpenCode with some 30B local models today and it got some useful stuff done while I was doing my budget, folding laundry, etc.

It’s less likely to “one shot” apples to apples compared to the big cloud models; Gemini 3 Pro can one shot reasonably complex coding problems through the chat interface. But through the agent interface where it can run tests, linters, etc. it does a pretty good job for the size of task I find reasonable to outsource to AI.

This is with a high end but not specifically AI-focused desktop that I mostly built with VMs, code compilation tasks, and gaming in mind some three years ago.

guerrilla 19 days ago

Yes, this is what I meant. People are running huge models at home now, I assumed people could do it on premises or in a data center if you're a business, presumably faster... but yeah it definitely depends on what time scales we're talking.

copperx 18 days ago

I'd love to know what kind of hardware would it take to do inference at the speed provided by the frontier model providers (assuming their models were available for local use).
10k worth of hardware? 50k? 100k?
Assuming a single user.
HumanOstrich 18 days ago

Huge models? First you have to spend $5k-$10k or more on hardware. Maybe $3k for something extremely slow (<1 tok/sec) that is disk-bound. So that's not a great deal over batch API pricing for a long, long time.
Also you still wouldn't be able to run "huge" models at a decent quantization and token speed. Kimi K2.5 (1T params) with a very aggressive quantization level might run on one Mac Studio with 512GB RAM at a few tokens per second.
To run Kimi K2.5 at an acceptable quantization and speed, you'd need to spend $15k+ on 2 Mac Studios with 512GB RAM and cluster them. Then you'll maybe get 10-15 tok/sec.