Comment by jckahn

2 months ago

Alternatively, just use a local model with zero restrictions.

8 comments

jckahn

The next best thing is to use the leading open source/open weights models for free or for pennies on OpenRouter [1] or Huggingface [2].

An article about the best open weight models, including Qwen and Kimi K2 [3].

[1]: https://openrouter.ai/models

[2]: https://huggingface.co

[3]: https://simonwillison.net/2025/Jul/30/

baq 2 months ago

This is currently negative expected value over the lifetime of any hardware you can buy today at a reasonable price, which is basically a monster Mac - or several - until Apple folds and rises the price due to RAM shortages.

master_crab 2 months ago

This requires hardware in the tens of thousands of dollars (if we want the tokens spit out at a reasonable pace).

Maybe in 3-5 years this will work on consumer hardware at speed, but not in the immediate term.

vntok 2 months ago
$2000 will get you 30~50 tokens/s on perfectly usable quantization levels (Q4-Q5), taken from any one among the top 5 best open weights MoE models. That's not half bad and will only get better!
- master_crab 2 months ago
  
  If you are running lightweight models like deepseek 32B. But anything more and it’ll drop. Also, costs have risen a lot in the last month for RAM and AI adjacent hardware. It’s definitely not 2k for the rig needed for 50 tokens a second
- threeducks 2 months ago
  
  Could you explain how? I can't seem to figure it out.
  DeepSeek-V3.2-Exp has 37B active parameters, GLM-4.7 and Kimi K2 have 32B active parameters.
  Lets say we are dealing with Q4_K_S quantization for roughly half the size, we still need to move 16 GB 30 times per second, which requires a memory bandwidth of 480 GB/s, or maybe half that if speculative decoding works really well.
  Anything GPU-based won't work for that speed, because PCIe 5 provides only 64 GB/s and $2000 can not afford enough VRAM (~256GB) for a full model.
  That leaves CPU-based systems with high memory bandwidth. DDR5 would work (somewhere around 300 GB/s with 8x 4800MHz modules), but that would cost about twice as much for just the RAM alone, disregarding the rest of the system.
  Can you get enough memory bandwidth out of DDR4 somehow?
- int_19h 2 months ago
  
  That doesn't sound realistic to me. What is your breakdown on the hardware and the "top 5 best models" for this calculation?
  
  1 reply →