Comment by reilly3000

2 months ago

dang I wish I could share md tables.

Here’s a text edition: For $50k the inference hardware market forces a trade-off between capacity and throughput:

* Apple M3 Ultra Cluster ($50k): Maximizes capacity (3TB). It is the only option in this price class capable of running 3T+ parameter models (e.g., Kimi k2), albeit at low speeds (~15 t/s).

* NVIDIA RTX 6000 Workstation ($50k): Maximizes throughput (>80 t/s). It is superior for training and inference but is hard-capped at 384GB VRAM, restricting model size to <400B parameters.

To achieve both high capacity (3TB) and high throughput (>100 t/s) requires a ~$270,000 NVIDIA GH200 cluster and data center infrastructure. The Apple cluster provides 87% of that capacity for 18% of the cost.

52 comments

reilly3000

mechagodzilla 2 months ago

You can keep scaling down! I spent $2k on an old dual-socket xeon workstation with 768GB of RAM - I can run Deepseek-R1 at ~1-2 tokens/sec.

Weryj 2 months ago
Just keep going! 2TB of swap disk for 0.0000001 t/sec
- kergonath 2 months ago
  
  Hang on, starting benchmarks on my Raspberry Pi.
  
  2 replies →
jacquesm 2 months ago
I did the same, then put in 14 3090's. It's a little bit power hungry but fairly impressive performance wise. The hardest parts are power distribution and riser cards but I found good solutions for both.
- r0b05 2 months ago
  
  I think 14 3090's are more than a little power hungry!
  
  7 replies →
- tucnak 2 months ago
  
  You get occasional accounts of 3090 home-superscalers whereas they would put up eight, ten, fourteen cards. I normally attribute this to obsessive-compulsive behaviour. What kind of motherboard you ended up using and what's the bi-directional bandwidth you're seeing? Something tells me you're not using EPYC 9005's with up to 256x PCIe 5.0 lanes per socket or something... Also: I find it hard to believe the "performance" claims, when your rig is pulling 3 kW from the wall (assuming undervolting at 200W per card?) The electricity costs alone would surely make this intractable, i.e. the same as running six washing machines all at once.
  
  3 replies →
ternus 2 months ago

And if you get bored of that, you can flip the RAM for more than you spent on the whole system!
a012 2 months ago

And heat the whole house in parallel
rpastuszak 2 months ago
Nice! What do you use it for?
- mechagodzilla 2 months ago
  
  1-2 tokens/sec is perfectly fine for 'asynchronous' queries, and the open-weight models are pretty close to frontier-quality (maybe a few months behind?). I frequently use it for a variety of research topics, doing feasibility studies for wacky ideas, some prototypy coding tasks. I usually give it a prompt and come back half an hour later to see the results (although the thinking traces are sufficiently entertaining that sometimes it's fun to just read as it comes out). Being able to see the full thinking traces (and pause and alter/correct them if needed) is one of my favorite aspects of being able to run these models locally. The thinking traces are frequently just as or more useful than the final outputs.
fatata123 2 months ago

[dead]

icedchai 2 months ago

For $50K, you could buy 25 Framework desktop motherboards (128G VRAM each w/Strix Halo, so over 3TB total) Not sure how you'll cluster all of them but it might be fun to try. ;)

sspiff 2 months ago
There is no way to achieve a high throughput low latency connection between 25 Strix Halo systems. After accounting for storage and network, there are barely any PCIe lanes left to link two of them together.
You might be able to use USB4 but unsure how the latency is for that.
- 0manrho 2 months ago
  
  In general I agree with you, the IO options exposed by Strix Halo are pretty limited, but if we're getting technical you can tunnel PCIe over USB4v2 by the spec in a way that's functionally similar to Thunderbolt 5. That gives you essentially 3 sets of native PCIe4x4 from the chipset and an additional 2 sets tunnelled over USB4v2. TB5 and USB4 controllers are not made equal, so in practice YMMV. Regardless of USB4v2 or TB5, you'll take a minor latency hit.
  Strix Halo IO topology: https://www.techpowerup.com/cpu-specs/ryzen-ai-max-395.c3994
  Frameworks mainboard implements 2 of those PCIe4x4 GPP interfaces as M.2 PHY's which you can use a passive adapter to connect a standard PCIe AIC (like a NIC or DPU) to, and also interestingly exposes that 3rd x4 GPP as a standard x4 length PCIe CEM slot, though the system/case isn't compatible with actually installing a standard PCIe add in card in there without getting hacky with it, especially as it's not an open-ended slot.
  You absolutely could slap 1x SSD in there for local storage, and then attach up to 4x RDMA supporting NIC's to a RoCE enabled switch (or Infiniband if you're feeling special) to build out a Strix Halo cluster (and you could do similar with Mac Studio's to be fair). You could get really extra by using a DPU/SmartNIC that allows you to boot from a NVMeoF SAN to leverage all 5 sets of PCIe4x4 for connectivity without any local storage but we're hitting a complexity/cost threshold with that that I doubt most people want to cross. Or if they are willing to cross that threshold, they'd also be looking at other solutions better suited to that that don't require as many workarounds.
  Apple's solution is better for a small cluster, both in pure connectivity terms and also with respect to it's memory advantages, but Strix Halo is doable. However, in both cases, scaling up beyond 3 or especially 4 nodes you rapidly enter complexity and cost territory that is better served by nodes that are less restrictive unless you have some very niche reason to use either Mac's (especially non-pro) or Strix Halo specifically.
- bee_rider 2 months ago
  
  Do they need fast storage, in this application? Their OS could be on some old SATA drive or whatever. The whole goal is to get them on a fast network together; the models could be stored on some network filesystem as well, right?
  
  1 reply →
- icedchai 2 months ago
  
  I figured, but it's good to have confirmation.
3abiton 2 months ago

You could use llama.cpp rpc mode over "network" via usb4/thunderbolt connection

3abiton 2 months ago

What's the math on the $50k nvidia cluster? My understanding these things cost ~$8k and you can at least get 5 for $40k, that's around half a tb.

That being said, for inference mac still remain the best, and the M5 Ultra will even be a better value with its better PP.

reilly3000 2 months ago
GPUs: 4x NVIDIA RTX 6000 Blackwell (96GB VRAM each) • Cost: 4 × $9,000 = $36,000
• CPU: AMD Ryzen Threadripper PRO 7995WX (96-Core) • Cost: $10,000
• Motherboard: WRX90 Chipset (supports 7x PCIe Gen5 slots) • Cost: $1,200
• RAM: 512GB DDR5 ECC Registered • Cost: $2,000
• Chassis & Power: Supermicro or specialized Workstation case + 2x 1600W PSUs. • Cost: $1,500
• Total Cost: ~$50,700
It’s a bit maximalist, but if you had to spend $50k it’s going to be about as fast as you can make it.
- broretore 2 months ago
  
  This is basically a tinybox pro?

FuckButtons 2 months ago

Are you factoring in the above comment about as yet un-implemented parallel speed up in there? For on prem inference without any kind of asic this seems quite a bargain relatively speaking.

conradev 2 months ago

Apple deploys LPDDR5X for the energy efficiency and cost (lower is better), whereas NVIDIA will always prefer GDDR and HBM for performance and cost (higher is better).

_zoltan_ 2 months ago
the GH/GB compute has LPDDR5X - a single or dual GPU shares 480GB, depending if it's GH or GB, in addition to the HBM memory, with NVLink C2C - it's not bad!
- wtallis 2 months ago
  
  Essentially, the Grace CPU is a memory and IO expander that happens to have a bunch of ARM CPU cores filling in the interior of the die, while the perimeter is all PHYs for LPDDR5 and NVLink and PCIe.
  
  7 replies →

yieldcrv 2 months ago

15 t/s way too slow for anything but chatting, call and response, and you don't need a 3T parameter model for that

Wake me up when the situation improves

rbanffy 2 months ago

Just wait for the M5-Ultra with a terabyte of RAM.

dsrtslnd23 2 months ago

what about a GB300 workstation with 784GB unified mem?

rbanffy 2 months ago

That thing will be extremely expensive I guess. And neither CPU nor GPU have that much memory. It's also not a great workstation either - macOS is a lot more comfortable to use.
wmf 2 months ago
$95K
- rbanffy 2 months ago
  
  I miss the time you could go to Apple's website and build the most obscene computer possible. With the M series, all options got a lot more limited. IIRC, an x86 Mac Pro with 1.5 TB of RAM, a big GPU and the two accelerators would yield an eye watering hardware bill.
  Now you need to add 8 $5K monitors to get something similarly ludicrous.
- dsrtslnd23 2 months ago
  
  do you have a source for that? I am trying to find pricing information but was not successful yet.