Comment by omnicognate
6 hours ago
> claim to have been trained for single-digit millions of dollars
Weren't these smaller models trained by distillation from larger ones, which therefore have to exist in order to do it? Are there examples of near state of the art foundation models being trained from scratch in low millions of dollars? (This is a genuine question, not arguing. I'm not knowledgeable in this area.)
The DeepSeek v3 paper claims to have trained from scratch for ~$5.5m: https://arxiv.org/pdf/2412.19437
Kimi K2 Thinking was reportedly trained for $4.6m: https://www.cnbc.com/2025/11/06/alibaba-backed-moonshot-rele...
Both of those were frontier models at the time of their release.
Another interesting number here is Claude 3.7 Sonnet, which may people (myself included) considered the best model for several months after its release and was apparently trained for "a few tens of millions of dollars": https://www.oneusefulthing.org/p/a-new-generation-of-ais-cla...