← Back to context

Comment by KronisLV

10 days ago

> As of May 2026, how much money do I need to spend to buy hardware to have a local model that is 80% as good as SOTA services for assisting me in writing code?

https://llm-stats.com/benchmarks/swe-bench-verified

SOTA (public proprietary models) would be Opus 4.7 at 0.876

80% of that would be around 0.7.

These models qualify, and are upwards of 90% as good in benchmarks:

  DeepSeek-V4-Pro-Max - 1.6T (HuggingFace shows 862B, huh) - 0.806
  Kimi K2.6 - 1.1T - 0.802
  MiniMax M2.5 - 229B - 0.802
  DeepSeek-V4-Flash-Max - 284B (HuggingFace shows 158B as well) - 0.790

These are 80-90% as good, which is also where you see the smaller ones:

  GLM-5 - 754B - 0.778
  Qwen3.6-27B - 27B - 0.772
  Kimi K2.5 - 1.1T - 0.768
  Qwen3.5-397B-A17B - 397B - 0.764
  Step-3.5-Flash - 199B - 0.744
  GLM-4.7 - 358B - 0.738
  MiMo-V2-Flash - 310B - 0.734
  Qwen3.6-35B-A3B - 35B - 0.734
  DeepSeek-V3.2 - 685B - 0.731
  DeepSeek-V3.2-Speciale - 685B - 0.731
  DeepSeek-V3.2 (Thinking) - 685B - 0.731
  Qwen3.5-27B - 27B - 0.724
  Qwen3.5-122B-A10B - 125B - 0.720
  Kimi K2-Thinking-0905 - 1T - 0.713
  LongCat-Flash-Thinking-2601 - 562B - 0.700

Out of those, the most modest one you could get is Qwen3.6-35B-A3B because the MoE nature makes it faster across more varied hardware.

I currently run the Unsloth 8bit quants on-prem (on a bunch of Nvidia L4 GPUs, since low TDP, long story), some people swear by more quantized versions but with the small models the impact is felt more: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF

So essentially you need up to 39 GB for the model itself and then some for the KV cache and whatever context size you want. Ideally I'd aim for 64 GB of memory for that, though if really pressed for resources, could get a heavily quantized version within 32 GB (but very little memory for context and kinda shit).

Personally, I think that you need about 45-60 tokens/second for decent usability - even comparatively modest hardware (including those L4) can run the model, though on the lower end options you will not be running parallel sub-agents etc.

Some random results for when you don't want a traditional multi-GPU setup:

  Mac Mini - about 1999 USD, gets you somewhere upwards of 30 tokens/second (depends on quantization and how you run it)
  Framework Desktop - about 2500 USD, gets you somewhere upwards of 25 tokens/second https://community.frame.work/t/framework-desktop-for-local-ai/80880/5
  DGX Spark - about 3500 USD, gets you somewhere upwards of 50 tokens/second https://forums.developer.nvidia.com/t/qwen-qwen3-6-35b-a3b-and-fp8-has-landed/366822/27

Some random results from pulling up random shops and approx. benchmarks, for dual GPU setups (not necessarily NVLink etc.):

  2x Intel Arc Pro B70 - about 1900 USD, gets you around 36 tokens/second, borderline usable, I blame their software stack
  2x Radeon AI PRO R9700 - about 3000 USD, gets you somewhere upwards of 60 tokens/second, usable
  2x Radeon PRO W7800 - about 5400 USD, same as above
  2x NVIDIA RTX 5090 - about 7600 USD, same as above
  2x NVIDIA RTX 5000 Ada - about 9200 USD, same as above

Of course, for those models, some of those cards are way overkill, but you definitely can get something for running local models without too many compromises involved. That said, you definitely will get a worse experience than SOTA cloud models at that 80% and will have to rework stuff quite a bit often, as my own experience with the Qwen model shows - okay for simple tasks, breaks down on complex stuff. For that, you'd want at least some of the 90% category models and would probably need to consider how much memory you can realistically get.

At least it's not hopeless!