← Back to context

Comment by jckahn

4 days ago

Alternatively, just use a local model with zero restrictions.

This is currently negative expected value over the lifetime of any hardware you can buy today at a reasonable price, which is basically a monster Mac - or several - until Apple folds and rises the price due to RAM shortages.

This requires hardware in the tens of thousands of dollars (if we want the tokens spit out at a reasonable pace).

Maybe in 3-5 years this will work on consumer hardware at speed, but not in the immediate term.

  • $2000 will get you 30~50 tokens/s on perfectly usable quantization levels (Q4-Q5), taken from any one among the top 5 best open weights MoE models. That's not half bad and will only get better!

    • If you are running lightweight models like deepseek 32B. But anything more and it’ll drop. Also, costs have risen a lot in the last month for RAM and AI adjacent hardware. It’s definitely not 2k for the rig needed for 50 tokens a second

    • Could you explain how? I can't seem to figure it out.

      DeepSeek-V3.2-Exp has 37B active parameters, GLM-4.7 and Kimi K2 have 32B active parameters.

      Lets say we are dealing with Q4_K_S quantization for roughly half the size, we still need to move 16 GB 30 times per second, which requires a memory bandwidth of 480 GB/s, or maybe half that if speculative decoding works really well.

      Anything GPU-based won't work for that speed, because PCIe 5 provides only 64 GB/s and $2000 can not afford enough VRAM (~256GB) for a full model.

      That leaves CPU-based systems with high memory bandwidth. DDR5 would work (somewhere around 300 GB/s with 8x 4800MHz modules), but that would cost about twice as much for just the RAM alone, disregarding the rest of the system.

      Can you get enough memory bandwidth out of DDR4 somehow?