← Back to context

Comment by dist-epoch

4 days ago

This problem was already solved 10 years ago - crypto mining motherboards, which have a large number of PCIe slots, a CPU socket, one memory slot, and not much else.

> Asus made a crypto-mining motherboard that supports up to 20 GPUs

https://www.theverge.com/2018/5/30/17408610/asus-crypto-mini...

For LLMs you'll probably want a different setup, with some memory too, some m.2 storage.

Those only gave each GPU a single PCIe lane though, since crypto mining barely needed to move any data around. If your application doesn't fit that mould then you'll need a much, much more expensive platform.

  • After you load the weights into the GPU and keep the KV cache there too, you don't need any other significant traffic.

    • Even in tensor parallel modes? I thought it could only work if you're fine stalling all but n GPU for n users at any given moments.

In theory, it’s only sufficient for pipeline parallel due to limited lanes and interconnect bandwidth.

Generally, scalability on consumer GPUs falls off between 4-8 GPUs for most. Those running more GPUs are typically using a higher quantity of smaller GPUs for cost effectiveness.