← Back to context

Comment by ACCount37

10 hours ago

It's not about how big your dataset is - it's about how you use it.

I jest, but I'm also completely serious. 1T tokens from Claude can teach a model something 1T tokens scraped from the open web can't. Things like "how an LLM can problem solve effectively", or "how an LLM should use tools", or "how to construct reasoning chains", or "when to double check", or "what innate capabilities an LLM can or can't rely on".

Those are valuable things that Anthropic's own team spent a lot of effort post-training into Claude. Distillation allows them to be extracted and transferred to an otherwise unremarkable base model.

Unremarkable base model will remain an unremarkable fine-tuned model that memorised a couple thousand of input-output pairings.

  • Ha ha, as if.

    Base models have a lot of capabilities - arranged in all the wrong ways for high performance reasoning and problem-solving. The power of fine tuning on "a couple thousand of input-output pairings" is that it can fix some of that. If your pairings are very well chosen, that is.

  • If that were the case, Anthropic wouldn't be throwing a fit over distillation "attacks".

    • Why? They often don't make sense. They send DMCA takedowns over materials they can't even copyright, for example. They fessed up to creating shadow libraries that they didn't even use in their training corpus, resulting in the largest copyright settlement ever. Your reasoning is flawed.

  • Yes, neural networks are famously poor at generalising.

    • They are poor at generalising from a small number of examples; this is why the real generalisation power is achieved in pre-training.

Can you back up this with hard data and evidence?

Most research converges to the idea that RL on synthetic data makes models worse, not better.

If what you claim was anywhere near that relevant, than we would've long achieved singularity by simply feeding increasingly better output to the training of the next model in a loop. Yet this doesn't work.

25 million turns on Claude output is a small amount, yet an expensive one (we talking hundreds of $ millions) that is better spent on compute.

There's no evidence such a process works, but I'd like to know more if I'm wrong.

  • > Most research converges to the idea that RL on synthetic data makes models worse, not better.

    You are missing a mountain of nuance by generalizing the existence of a hole there.

  • Back up what? That distilling from a more capable model into a less capable model pulls the student model's capabilities up? What. Why the fuck is this even a question.

    Look up literally any distillation works. Because this is just distillation but on one-hot token chains instead of richer logit KL proxies.

    And no, I'm not claiming than you can "close the loop" and get RSI on the cheap just by distilling forever. I'm claiming that distillation is a very cheap way to bring the performance of a less capable model closer to that of a more capable model. It doesn't give you "a more capable model" out of thin air.

    Which is why Chinese labs rely on Anthropic to provide that "more capable model" to them. They take the capabilities Anthropic trained for the hard way, and train for them the easy way.

    It's a "fast follower"/"improved capability density" trick, not a "singularity tomorrow" trick. There are a few "distillation pump" tricks that get closer to what you have in mind, but they're still more about "extract more training signal out of the same set of data" than about "unbounded RSI".

    • so the way llms work in the first place. training on original research that was acquired the hard way.

    • Okay, you have no data nor evidence nor a paper backing this claim, it's just speculation.

      You want to sell me the idea they are spending hundreds of millions to get unchecked Q/As with reasoning redacted and without checks on the output quality to do what exactly?

      Have a shallow pointless bunch of expensive data to get slightly better RL? It's expensive and pointless.

      Data has shown again and again that synthetic input/output does not benefit models in RL, it may even make the output worse.

      Also, you have a giant bias.

      The chinese are the only ones releasing models and research papers in the open from which American labs benefit 24/7 (DeepSeek has been copied by all US providers).

      And you want to sell me this ridiculous idea of the giant return of spending hundreds of millions on unredacted pointless QAs?

      7 replies →