Comment by Philpax

5 months ago

Eh, to some extent - there's still a pretty significant cost to actually running inference for those models. For example, no consumer can run DeepSeek v3/r1 - that requires tens, possibly hundreds, of thousands of dollars of hardware to run.

There's still room for other models, especially if they have different performance characteristics that make them suitable to run under consumer constraints. Mistral has been doing quite well here.

If you don't need to pay for the model development costs, I think running inference will just be driven down to the underlying cloud computing costs. The actual requirement to passably (~4-bit quantization) run Deepseek v3/r1 at home is really just having 512GB or so of RAM - I bought a used dual-socket xeon for $2k that has 768GB of RAM, and can run Deepseek R1 at 1-1.5 tokens/sec, which is perfectly usable for "ask a complicated question, come back an hour or so later and check on the result".