Comment by michelsedgh

2 months ago

Dude he's running locally, and I think this setup is the best bang for the buck if you wanna run locally, we're not comparing to data centers, you gotta keep it in perspective. That's very impressive results for running local. Thanks for the numbers you saved me a chatgpt search :)

5 comments

michelsedgh

carstenhag 2 months ago

Title says: locally it's expensive

Other person says: I had to spend 4000$ and it's still slow

justsid 2 months ago

Not to mention that $4000 is in fact expensive. If anything the OP really makes the point of the articles title.

BoorishBears 2 months ago

CPU-only is really terrible bang for your buck, and I wish people would stop pushing these impractical builds on people genuinely curious in local AI.

The KV cache won't soften the blow the first time they paste a code sample into a chat and end up waiting 10 minutes with absolutely no interactivity before they even get first token.

You'll get an infinitely more useful build out of a single 3090 and sticking to stuff like Gemma 27B than you will out of trying to run Deepseek off a CPU-only build. Even a GH200 struggles to run Deepseek at realistic speeds with bs=1, and there's an entire H100 attached to CPU there: there just isn't a magic way to get "affordable fast effective" AI out of a CPU offloaded model right now.

ryan_glass 2 months ago
The quality on Gemma 27B is nowhere near good enough for my needs. None of the smaller models are.
- BoorishBears 2 months ago
  
  And that's fine, but the average person asking is already willing to give up some raw intelligence going local, and would not expect the kind of abysmal performance you're likely getting after describing it as "fast".
  I setup Deepseek bs=1 on a $41,000 GH200 and got double digit prompt processing speeds (~50 tk/s): you're definitely getting worse performance than the GH200 was, and that's already unacceptable for most users.
  They'd be much better served spending less money than you had to spend and getting an actually interactive experience, instead of having to send off prompts and wait several minutes to get an actual reply the moment the query involves any actual context.