Comment by yjftsjthsd-h

2 months ago

I've been kicking this around in my head for a while. If I want to run LLMs locally, a decent GPU is really the only important thing. At that point, the question becomes, roughly, what is the cheapest computer to tack on the side of the GPU? Of course, that assumes that everything does in fact work; unlike OP I am barely in a position to understand eg. BAR problems, let alone try to fix them, so what I actually did was build a cheap-ish x86 box with a half-decent GPU and called it a day:) But it still is stuck in my brain: there must be a more efficient way to do this, especially if all you need is just enough computer to shuffle data to and from the GPU and serve that over a network connection.

40 comments

yjftsjthsd-h

binsquare 2 months ago

I run a crowd sourced website to collect data on the best and cheapest hardware setup for local LLM here: https://inferbench.com/

Source code: https://github.com/BinSquare/inferbench

nodja 2 months ago
Cool site, I noticed the 3090 is on there twice.
https://inferbench.com/gpu/NVIDIA%20GeForce%20RTX%203090
https://inferbench.com/gpu/NVIDIA%20RTX%203090
- binsquare 2 months ago
  
  Oh nice catch, I'll fix that
  ---
  Edit: Fixed
kilpikaarna 2 months ago
Nice! Though for older hardware it would be nice if the price reflected the current second hand market (harder to get data for, I know). Eg. Nvidia RTX 3070 ranks as second best GPU in tok/s/$ even at the MSRP of $499. But you can get one for half that now.
- binsquare 2 months ago
  
  Great idea - I've added it by manually browsing ebay for that initial data.
  So it's just a static value in this hardware list: https://github.com/BinSquare/inferbench/blob/main/src/lib/ha...
  Let me know if you know of a better way, or contribute :D
_ea1k 2 months ago
It seems like verification might need to be improved a bit? I looked at Mistral-Large-123B. Someone is claiming 12 tokens/sec on a single RTX 3090 at FP16.
Perhaps some filter could cut out submissions that don't really make sense?
- binsquare 2 months ago
  
  Great idea - took a bit to figure out how to implement this.
  I came up with a plausibility check based on the model's memory requirements: https://github.com/BinSquare/inferbench/blob/main/src/lib/pl...
  So now on the submission page - it has a warning + an automate flag count for volunteers to double check:
```This configuration seems unlikely
Model requires ~906GB VRAM but only 32GB available (28.3x over). This likely requires significant CPU offload which would severely impact performance.
You can still submit, but your result will be flagged for review.```

dist-epoch 2 months ago

This problem was already solved 10 years ago - crypto mining motherboards, which have a large number of PCIe slots, a CPU socket, one memory slot, and not much else.

> Asus made a crypto-mining motherboard that supports up to 20 GPUs

https://www.theverge.com/2018/5/30/17408610/asus-crypto-mini...

For LLMs you'll probably want a different setup, with some memory too, some m.2 storage.

jsheard 2 months ago
Those only gave each GPU a single PCIe lane though, since crypto mining barely needed to move any data around. If your application doesn't fit that mould then you'll need a much, much more expensive platform.
- dist-epoch 2 months ago
  
  After you load the weights into the GPU and keep the KV cache there too, you don't need any other significant traffic.
  
  1 reply →
skhameneh 2 months ago

In theory, it’s only sufficient for pipeline parallel due to limited lanes and interconnect bandwidth.
Generally, scalability on consumer GPUs falls off between 4-8 GPUs for most. Those running more GPUs are typically using a higher quantity of smaller GPUs for cost effectiveness.
zozbot234 2 months ago

M.2 is mostly just a different form factor for PCIe anyway.

Eisenstein 2 months ago

There is a whole section in here on how to spec out a cheap rig and what to look for:

* https://jabberjabberjabber.github.io/Local-AI-Guide/

tcdent 2 months ago

We're not yet to the point where a single PCIe device will get you anything meaningful; IMO 128 GB of ram available to the GPU is essential.

So while you don't need a ton of compute on the CPU you do need the ability address multiple PCIe lanes. A relatively low-spec AMD EPYC processor is fine if the motherboard exposes enough lanes.

skhameneh 2 months ago
There is plenty that can run within 32/64/96gb VRAM. IMO models like Phi-4 are underrated for many simple tasks. Some quantized Gemma 3 are quite good as well.
There are larger/better models as well, but those tend to really push the limits of 96gb.
FWIW when you start pushing into 128gb+, the ~500gb models really start to become attractive because at that point you’re probably wanting just a bit more out of everything.
- tcdent 2 months ago
  
  IDK all of my personal and professional projects involve pushing the SOTA to the absolute limit. Using anything other than the latest OpenAI or Anthropic model is out of the question.
  Smaller open source models are a bit like 3d printing in the early days; fun to experiment with but really not that valuable for anything other than making toys.
  Text summarization, maybe? But even then I want a model that understands the complete context and does a good job. Even things like "generate one sentence about the action we're performing" I usually find I can just incorporate it into the output schema of a larger request instead of making a separate request to a smaller model.
  
  6 replies →
p1necone 2 months ago
I'm holding out for someone to ship a gpu with dimm slots on it.
- tymscar 2 months ago
  
  DDR5 is a couple of orders of magnitude slower than really good vram. That’s one big reason.
  
  7 replies →
- anon25783 2 months ago
  
  Would that be worth anything, though? What about the overhead of clock cycles needed for loading from and storing to RAM? Might not amount to a net benefit for performance, and it could also potentially complicate heat management I bet.
- kristianp 2 months ago
  
  A single CAMM might suit better.

seanmcdirmid 2 months ago

And you don’t want to go the M4 Max/M3 Ultra route? It works well enough for most mid sized LLMs.

zeusk 2 months ago

Get the DGX Spark computers? They’re exactly what you’re trying to build.

Gracana 2 months ago
They’re very slow.
- geerlingguy 2 months ago
  
  They're okay, generally, but slow for the price. You're more paying for the ConnectX-7 networking than inference performance.
  
  1 reply →