Comment by HarHarVeryFunny
10 hours ago
So we're meant to believe that only US companies have the intelligence and/or access to manpower to generate their own reasoning data? Does China have a population deficit? Maybe China has too high wages to pay people to generate reasoning data?
The US companies bootstrapped themselves from one model generation to the next, partly by using the previous generation to generate synthetic data, etc, and partly by paying people to hand generate training data for them. Why do you apparently assume that the Chinese can't do the exact same thing?!
Surely "coding performance" is by far the easiest thing to generate your own RLVF data for, since it has trivial verifiable rewards - does the code compile and do what you want.
RLVR is the poster child for model distillation. Because: have you considered just how many tokens does a model have to generate before you can check "does the code compile and do what you want"?
You generate 90000 tokens worth of rollout and get a verifiable reward once. RLVR is fucking expensive! It's worth it, because it often unlocks capability advances that other things don't. But it's still fucking expensive. RLVR eats compute like nothing else.
So, if someone used a lot of RLVR to improve a capability? Just distill from that "someone" and get a similar improvement for a fraction of the price! Then you can do your own RLVR from THAT cheap starting point, if you want to.
"Human domain experts" is a similar niche. Let's say hypothetical "EconomicsAI" hired some $200 per hour human economists to make training data for their "EconGPT" AI. What's cheaper - hiring your own $200 per hour economists, or using a bunch of "$10 per 1M tokens" outputs of EconGPT to bring your own model in line with what EconGPT can do?
Even synthetics can be expensive, because while synthetic tokens themselves are relatively cheap, the applied AI knowledge one needs to make high quality synthetics that improve task performance and don't backfire on you isn't. Again: distillation bypasses a lot of that - by cribbing from the outputs of a model someone has already done that for. Allowing you to get more oomph for cheaper, and spend your R&D effort elsewhere.
Your training cost argument makes no sense. It doesn't matter whether you are using human written code or someone else's LLM generated code to train on - you are going to be RL training on it, so your RL training cost is the same.
There is a data cost argument, especially if you are paying for human generated data, although I'm not sure how applicable that is to coding.