Comment by omnicognate

2 months ago

> claim to have been trained for single-digit millions of dollars

Weren't these smaller models trained by distillation from larger ones, which therefore have to exist in order to do it? Are there examples of near state of the art foundation models being trained from scratch in low millions of dollars? (This is a genuine question, not arguing. I'm not knowledgeable in this area.)

1 comment

omnicognate

simonw 2 months ago

The DeepSeek v3 paper claims to have trained from scratch for ~$5.5m: https://arxiv.org/pdf/2412.19437

Kimi K2 Thinking was reportedly trained for $4.6m: https://www.cnbc.com/2025/11/06/alibaba-backed-moonshot-rele...

Both of those were frontier models at the time of their release.

Another interesting number here is Claude 3.7 Sonnet, which may people (myself included) considered the best model for several months after its release and was apparently trained for "a few tens of millions of dollars": https://www.oneusefulthing.org/p/a-new-generation-of-ais-cla...