Comment by princehonest

2 months ago

Let's say you had a hardware budget of $5,000. What machine would you buy or build to run Devstral Small 2? The HuggingFace page claims it can run on a Mac with 32 GB of memory or an RTX 4090. What kind of tokens per second would you get on each? What about DGX Spark? What about RTX 5090 or Pro series? What about external GPUs on Oculink with a mini PC?

19 comments

princehonest

clusterhacks 2 months ago

All those choices seem to have very different trade-offs? I hate $5,000 as a budget - not enough to launch you into higher-VRAM RTX Pro cards, too much (for me personally) to just spend on a "learning/experimental" system.

I've personally decided to just rent systems with GPUs from a cloud provider and setup SSH tunnels to my local system. I mean, if I was doing some more HPC/numerical programming (say, similarity search on GPUs :-) ), I could see just taking the hit and spending $15,000 on a workstation with an RTX Pro 6000.

For grins:

Max t/s for this and smaller models? RTX 5090 system. Barely squeezing in for $5,000 today and given ram prices, maybe not actually possible tomorrow.

Max CUDA compatibility, slower t/s? DGX Spark.

Ok with slower t/s, don't care so much about CUDA, and want to run larger models? Strix Halo system with 128gb unified memory, order a framework desktop.

Prefer Macs, might run larger models? M3 Ultra with memory maxed out. Better memory bandwidth speed, mac users seem to be quite happy running locally for just messing around.

You'll probably find better answers heading off to https://www.reddit.com/r/LocalLLaMA/ for actual benchmarks.

kpw94 2 months ago
> I've personally decided to just rent systems with GPUs from a cloud provider and setup SSH tunnels to my local system.
That's a good idea!
Curious about this, if you don't mind sharing:
- what's the stack ? (Do you run like llama.cpp on that rented machine?)
- what model(s) do you run there?
- what's your rough monthly cost? (Does it come up much cheaper than if you called the equivalent paid APIs)
- clusterhacks 2 months ago
  
  I ran ollama first because it was easy, but now download source and build llama.cpp on the machine. I don't bother saving a file system between runs on the rented machine, I build llama.cpp every time I start up.
  I am usually just running gpt-oss-120b or one of the qwen models. Sometimes gemma? These are mostly "medium" sized in terms of memory requirements - I'm usually trying unquantized models that will easily run on an single 80-ish gb gpu because those are cheap.
  I tend to spend $10-$20 a week. But I am almost always prototyping or testing an idea for a specific project that doesn't require me to run 8 hrs/day. I don't use the paid APIs for several reasons but cost-effectiveness is not one of those reasons.
  
  7 replies →

tgtweak 2 months ago

dual 3090's (24GB each) on 8x+8x pcie has been a really reliable setup for me (with nvlink bridge... even though it's relatively low bandwidth compared to tesla nvlink, it's better than going over pcie!)

48GB of vram and lots of cuda cores, hard to beat this value atm.

If you want to go even further, you can get an 8x V100 32GB server complete with 512GB ram and nvlink switching for $7000 USD from unixsurplus (ebay.com/itm/146589457908) which can run even bigger models and with healthy throughput. You would need 240V power to run that in a home lab environment though.

lostmsu 2 months ago
V100 is outdated (no bf16, dropped in CUDA 13) and power hungry (8 cards 3 years continuous use are about $12k of electricity).
- tgtweak 2 months ago
  
  Depends where you are plugging them in - but yes they are older gen - despite this, 8xV100 will outperform most of what you can buy for that price simply by way of memory and nvlink bandwidth. If you want to practically run a local model that takes 200GB of memory (Devstral-2-123B-Instruct-2512 for example or GPT-OSS-120B with long context window) without resorting to aggressive ggufs or memory swapping, you don't have many cheaper options. You can also parallelize several models on one node to get some additional throughput for bulk jobs.

monster_truck 2 months ago

I'd throw a 7900xtx in an AM4 rig with 128gb of ddr4 (which is what I've been using for the past two years)

Fuck nvidia

sofixa 2 months ago
Or a Strix Halo Ryzen AI Max. Lots of "unified" memory that can be dedicated to the GPU portion, for not that expensive. Read through benchmarks to know if the performance will be enough for your needs though.
- cmrdporcupine 2 months ago
  
  Do you think the larger Mistral model would fit on a AI Max 395? I've been thinking about buying one of those machines, but haven't convinced myself yet.
clusterhacks 2 months ago
You know, I haven't even been thinking about those AMD gpus for local llms and it is clearly a blind spot for me.
How is it? I'd guess a bunch of the MoE models actually run well?
- stusmall 2 months ago
  
  I've been running local models on an AMD 7800 XT with ollama-rocm. I've had zero technical issues. It's really just the usefulness of a model with only 16GB vram + 64GB of main RAM is questionable, but that isn't an AMD specific issue. It was a similar experience running locally with an nvidia card.
androiddrew 2 months ago

Get a Radeon AI Pro r9700! 32GB of RAM