Comment by plagiarist

2 months ago

Could I get your thoughts on the Asus GX10 vs. spending on GPU compute? It seems like one could get a lot of total VRAM with better memory bandwidth and make PCIe the bottleneck. Especially if you already have a motherboard with spare slots.

I'm trying to better understand the trade offs, or if it depends on the workload.

5 comments

plagiarist

yowlingcat 2 months ago

Run a model at all, run a model fast, run a model cheap. Pick 2.

With LLM workloads, you can run some of the larger local models (at all) and you can run them cheap on the unified 128G RAM machines (Strix Halo/Spark) - for example, gpt-oss-120b. At 4bit quantization given it's an MoE that's natively trained at NVFP4, it'll be pretty quick. Some of the other MoEs with highly compressed active parameter models will also be quick as well. But things will get sluggish as the active parameters increase. The best way to run these models is with a multi-GPU rig so you get speed and VRAM density at once, but that's expensive.

With other workloads such as image/video generation, the unified vram doesn't help as much and the operations themselves intrinsically run better on the beefier GPU cores, in part because many of the models are relatively small compared to LLM (6B-20B active parameters) but generating from those parameters is definitely GPU compute intensive. So you get infinitely more from a 3090 (maybe even a slightly lesser card) than you do from a unified memory rig.

If you're running a mixture of LLM and image/video generation workloads, there is no easy answer. Some folks on a budget opt for a unified memory machine with an eGPU to get the best of both worlds, but I hear drivers are an issue. Some folks use the Mac studios which while quite fast force you to be inside the Metal ecosystem rather than CUDA and aren't as pleasant for dev or user ecosystem. Some folks build a multi CPU server rig with a ton of vanilla RAM (used to be popular for folks who wanted to run DeepSeek before RAM prices spiked). Some folks buy older servers with VRAM dense but dated cards (thing Pascal, Volta, etc, or AMD MI50/100). There's no free lunch with any of these options, honestly.

If you don't have a very clear sense of something you can buy that you won't regret, it's hard to go wrong using any of the cloud GPU hyperscalers (Runpod, Modal, Northflank, etc) or something like Fal or Replicate where you can try out the open source models and pay per request. Sure, you'll spend a bit more on unit costs, but it'll force you to figure out if you have your workloads figured out enough to where the pain of having it in the cloud stings enough to where you want to buy and own the metal -- if the answer is no, even if you could afford it, you'll often be most happiest just using the right cloud service!

Ask me how I figured out all of the above the hard way...

fragmede 2 months ago
Thank you for sharing your painful learning experiences! What do you have now?
- yowlingcat 2 months ago
  
  Locally, I use a Mac Studio with a ton of VRAM and just accept the limitations of the Metal ecosystem, which is generally fine for the inference workloads I am consistently running locally (but I think would be a pain for a lot of people).
  I can't see it making sense for training workloads if and when I get to them (which I'd put on the cloud). I have a box with a single 3090 to do CUDA dev if I need to but I haven't needed to do it that often. And frankly the Mac Studio has rough computational parity with a bit under a 3090 in terms of grunt, but with an order of magnitude more unified VRAM so it hits the mark for medium-ish MoE models I like to run locally as well as some of the diffusion inference workloads.
  Anything that doesn't work great locally or which is throwaway (but needs to be fast) ends up getting thrown at the cloud. I pull it back to something I can run locally once I'm running it over and over again on a recurring basis.
Y_Y 2 months ago

I choose fast and cheap

coder543 2 months ago

It depends entirely on what you want to do, and how much you're willing to deal with a hardware setup that requires a lot of configuration. Buying several 3090s can be powerful. Buying one or two 5090s can be awesome, from what I've heard.