← Back to context

Comment by khazhoux

2 months ago

I have a bunch of questions, would love for anyone to explain these basics:

* The $5M DeepSeek-R1 (and now this cheap $6 R1) are both based on very expensive oracles (if we believe DeepSeek-R1 queried OpenAI's model). If these are improvements on existing models, why is this being reported as decimating training costs? Isn't fine-tuning already a cheap way to optimize? (maybe not as effective, but still)

* The R1 paper talks about improving one simple game - Countdown. But the original models are "magic" because they can solve a nearly uncountable number of problems and scenarios. How does the DeepSeek / R1 approach scale to the same gigantic scale?

* Phrased another way, my understanding is that these techniques are using existing models as black-box oracles. If so, how many millions/billions/trillions of queries must be probed to replicate and improve the original dataset?

* Is anything known about the training datasets used by DeepSeek? OpenAI used presumably every scraped dataset they could get their hands on. Did DS do the same?

If what you say is true, and distilling LLMs is easy and cheap, and pushing the SOTA without a better model to rely on is dang hard and expensive, then that means the economics of LLM development might not be attractive to investors - spending billions to have your competitors come out with products that are 99% as good, and cost them pennies to train, does not sound like a good business strategy.

  • What I still don’t understand is how one slurps out an entire model (closed source) though.

    Does the deepseek paper actually say what model it’s trained off of, or do they claim the entire thing is from scratch?

    • AFAIK DeepSeek have not publicly acknowledged training their model on OpenAI output - the OpenAI people have alleged that they did.

      At any rate, I don't think distillation involves 'slurping out' the whole model, as I understand it, it means providing the other model's output as training data input to create your new model. Maybe analogous to an expert teaching a novice how to do something by providing carefully selected examples, without having to expose the novice to all the blind alleys the expert went down to achieve mastery.

> If these are improvements on existing models, why is this being reported as decimating training costs?

Because that's what gets the clicks...

Saying they spent a boatload of money on the initial training + iteration + final fine-tuning isn't as headline grabbing as "$5 million trained AI beats the pants off the 'mericans".