Comment by Barathkanna

13 days ago

A realistic setup for this would be a 16× H100 80GB with NVLink. That comfortably handles the active 32B experts plus KV cache without extreme quantization. Cost-wise we are looking at roughly $500k–$700k upfront or $40–60/hr on-demand, which makes it clear this model is aimed at serious infra teams, not casual single-GPU deployments. I’m curious how API providers will price tokens on top of that hardware reality.

16 comments

Barathkanna

wongarsu 13 days ago

The weights are int4, so you'd only need 8xH100

a2128 13 days ago

You don't need to wait and see, Kimi K2 has the same hardware requirements and has several providers on OpenRouter:

https://openrouter.ai/moonshotai/kimi-k2-thinking https://openrouter.ai/moonshotai/kimi-k2-0905 https://openrouter.ai/moonshotai/kimi-k2-0905:exacto https://openrouter.ai/moonshotai/kimi-k2

Generally it seems to be in the neighborhood of $0.50/1M for input and $2.50/1M for output

reissbaker 13 days ago

Generally speaking, 8xH200s will be a lot cheaper than 16xH100s, and faster too. But both should technically work.

pama 13 days ago

You can do it and may be ok for single user with idle waiting times, but performance/throughput will be roughly halved (closer to 2/3) and free context will be more limited with 8xH200 vs 16xH100 (assuming decent interconnect). Depending a bit on usecase and workload 16xH100 (or 16xB200) may be a better config for cost optimization. Often there is a huge economy of scale with such large mixture of expert models so that it would even be cheaper to use 96 GPU instead of just 8 or 16. The reasons are complicatet and involve better prefill cache, less memory transfer per node.

bertili 13 days ago

The other realistic setup is $20k, for a small company that needs a private AI for coding or other internal agentic use with two Mac Studios connected over thunderbolt 5 RMDA.

Barathkanna 13 days ago
That won’t realistically work for this model. Even with only ~32B active params, a 1T-scale MoE still needs the full expert set available for fast routing, which means hundreds of GB to TBs of weights resident. Mac Studios don’t share unified memory across machines, Thunderbolt isn’t remotely comparable to NVLink for expert exchange, and bandwidth becomes the bottleneck immediately. You could maybe load fragments experimentally, but inference would be impractically slow and brittle. It’s a very different class of workload than private coding models.
- bertili 13 days ago
  
  People are running the previous Kimi K2 on 2 Mac Studios at 21tokens/s or 4 Macs at 30tokens/s. Its still premature, but not a completely crazy proposition for the near future, giving the rate of progress.
  
  1 reply →
- zozbot234 13 days ago
  
  If "fast" routing is per-token, the experts can just reside on SSD's. the performance is good enough these days. You don't need to globally share unified memory across the nodes, you'd just run distributed inference.
  Anyway, in the future your local model setups will just be downloading experts on the fly from experts-exchange. That site will become as important to AI as downloadmoreram.com.
- YetAnotherNick 13 days ago
  
  Depends on if you are using tensor parallelism or pipeline parallelism, in the second case you don't need any sharing.
- omneity 13 days ago
  
  RDMA over Thunderbolt is a thing now.
embedding-shape 13 days ago
I'd love to see the prompt processing speed difference between 16× H100 and 2× Mac Studio.
- zozbot234 13 days ago
  
  Prompt processing/prefill can even get some speedup from local NPU use most likely: when you're ultimately limited by thermal/power limit throttling, having more efficient compute available means more headroom.
- Barathkanna 13 days ago
  
  I asked GPT for a rough estimate to benchmark prompt prefill on an 8,192 token input. • 16× H100: 8,192 / (20k to 80k tokens/sec) ≈ 0.10 to 0.41s • 2× Mac Studio (M3 Max): 8,192 / (150 to 700 tokens/sec) ≈ 12 to 55s
  These are order-of-magnitude numbers, but the takeaway is that multi H100 boxes are plausibly ~100× faster than workstation Macs for this class of model, especially for long-context prefill.
  
  1 reply →
zozbot234 13 days ago

That's great for affordable local use but it'll be slow: even with the proper multi-node inference setup, the thunderbolt link will be a comparative bottleneck.