← Back to context

Comment by ACCount37

10 hours ago

Back up what? That distilling from a more capable model into a less capable model pulls the student model's capabilities up? What. Why the fuck is this even a question.

Look up literally any distillation works. Because this is just distillation but on one-hot token chains instead of richer logit KL proxies.

And no, I'm not claiming than you can "close the loop" and get RSI on the cheap just by distilling forever. I'm claiming that distillation is a very cheap way to bring the performance of a less capable model closer to that of a more capable model. It doesn't give you "a more capable model" out of thin air.

Which is why Chinese labs rely on Anthropic to provide that "more capable model" to them. They take the capabilities Anthropic trained for the hard way, and train for them the easy way.

It's a "fast follower"/"improved capability density" trick, not a "singularity tomorrow" trick. There are a few "distillation pump" tricks that get closer to what you have in mind, but they're still more about "extract more training signal out of the same set of data" than about "unbounded RSI".

so the way llms work in the first place. training on original research that was acquired the hard way.

Okay, you have no data nor evidence nor a paper backing this claim, it's just speculation.

You want to sell me the idea they are spending hundreds of millions to get unchecked Q/As with reasoning redacted and without checks on the output quality to do what exactly?

Have a shallow pointless bunch of expensive data to get slightly better RL? It's expensive and pointless.

Data has shown again and again that synthetic input/output does not benefit models in RL, it may even make the output worse.

Also, you have a giant bias.

The chinese are the only ones releasing models and research papers in the open from which American labs benefit 24/7 (DeepSeek has been copied by all US providers).

And you want to sell me this ridiculous idea of the giant return of spending hundreds of millions on unredacted pointless QAs?

  • What the fuck. Are you a literal, honest to god distillation denier? Straight up "wake up sheeple, model distillation isn't real"?

    I've seen plenty of things in the dumpsters of AI discourse, but this got to be among the most baffling.

    Yes, there are "giant returns" on distilling from a more capable model into a less capable model. And even more so when the more capable model was trained for something you want and lack. Like: better coding performance.

    Someone like OpenAI had to RLVR for it the hard way (and if you think "distillation is expensive", wait till you hear how many bits per rollout hardcore RLVR gets you), but you get to peek into the results of their work and copy them for yourself.

    Also, Anthropic didn't redact model reasoning until Mythos. OpenAI started with o1, but Claude had reasoning chains accessible for a long time. Which is why Anthropic was more targeted than OpenAI.

    • So we're meant to believe that only US companies have the intelligence and/or access to manpower to generate their own reasoning data? Does China have a population deficit? Maybe China has too high wages to pay people to generate reasoning data?

      The US companies bootstrapped themselves from one model generation to the next, partly by using the previous generation to generate synthetic data, etc, and partly by paying people to hand generate training data for them. Why do you apparently assume that the Chinese can't do the exact same thing?!

      Surely "coding performance" is by far the easiest thing to generate your own RLVF data for, since it has trivial verifiable rewards - does the code compile and do what you want.

      2 replies →