Comment by vadansky

1 day ago

Can I run something comparable to Opus 4.6 locally yet? I keep hearing conflicting things. If I can spend 10k to do that I would cancel my subscription. The problem is I don’t wanna spend the money to find out myself.

18 comments

vadansky

Catloafdev 1 day ago

If you want frontier-level, the economically reasonable option is OpenRouter or a direct sub to frontier-of-your-choice.

The reality is that they do not offer configurations that would allow a consumer to run that much VRAM on a single setup to protect datacenter margins. Apple used to, and they stopped, those devices are going for ~$20k+ each on ebay now.

You can get very, very capable models on a 3090/4090/5090/6000 series card. But if you want 'frontier level' you are investing ~22k at a bare minimum if you go new. Used you can probably build your own server for much cheaper up-front cost but it's likely going to be 4-6x+ electricity usage.

daemonologist 1 day ago
There are also significant economies of scale (namely: utilization and batching), which tend to make inference on a shared server more economical even after the operator takes a cut.
- zozbot234 1 day ago
  
  You can use batching on consumer hardware, it just requires a KV-cache efficient model (or short context only) and keeping multiple inference flows running in parallel. This is most useful in combination with streamed inference, since the compute intensity of decode with those newer KV-compressed models is high enough that you have limited compute headroom when running at the speed of RAM.
theossuary 1 day ago
I truly think by 2028 we'll have integrated chip systems that'll be able to run opus 4.8 level models at ~500 watts at acceptable performance. Honestly I think now is the worst time to invest in AI hardware. Get your harness ready and processes perfected with hosted models, and wait a few years to buy hardware to transition to running models locally
- baq 1 day ago
  
  Burning weights onto a chip in an efficient way and exposing that via USB would be acceptable for a good enough model tbh
  
  2 replies →
- hurtigioll 1 day ago
  
  if such hardware becomes available, it will be bought by the data-centers, just like they buy all the RAM today
- CamperBob2 1 day ago
  
  Honestly I think now is the worst time to invest in AI hardware.
  That position is not without its own risks, though. Maybe Opus 4.8 will run on a single chip by 2028... and maybe you won't be allowed to touch it.
  And what if Xi makes a play for Taiwan? That would be stupid, but so was invading Ukraine with tanks from Temu, and it still happened.
  
  2 replies →

grim_io 1 day ago

10k will not get you anywhere near opus or sonnet. It's simply not possible for mere mortals currently.

als0 1 day ago

> Can I run something comparable to Opus 4.6 locally yet?

Sadly, no. The best comparable thing you can get is about Sonnet 3.7

captaintobs 1 day ago

i spent 8k and get close to a 2-3x slower sonnet. running 2x spark deep seek v4 flash

CamperBob2 1 day ago

Some benchmarks have shown Kimi K2.6 within error-bar distance of Opus 4.6, and you can run it on eight RTX6000s. Right now it's not possible to set up a machine like that from scratch for less than $100K... but right now it's also hard to put a price on autonomy.

zozbot234 1 day ago

You need a lot less than that if you're willing to stream the model from SSD. At that point, the best machine is probably a cheap old-gen HEDT with lots of PCIe lanes to attach cheap NVMe storage to, so as to stream the model at reasonable speed. That's expensive but not $100k expensive!

atemerev 1 day ago

Best you could do is connect two Mac Studio M3 Ultra 512G RAM each with Thunderbolt. Then theoretically you can run frontier Chinese models (but not Deepseek v4 Pro yet). That would be about $20k.

But - good luck finding them. Apple discontinued the model a few months ago. And more recently, even 256G model was discontinued. Big AI really really does not want people to get off their needle.

zozbot234 1 day ago

DeepSeek V4 Pro is ~800GB total at native quantization (1.6T params with most being 4-bit) so it can run on the hardware you mentioned. There is also a 2-bit version that will run on a single 512GB machine. SSD streaming also makes lower-end hardware viable to at least test the model, if not quite run it usefully.