Comment by throw83288
5 months ago
I think what's missing is that the amount of training data to effectively RL usually decreases over time. AlphaGo needed some initial data on good games of Go to then recursively improve via RL. Fast forward a few years, and AlphaZero doesn't need any data to recursively improve.
This is what I mean by generalization skills. You need trillions of lines of code to RL a model into a good SWE right now, but as the models grow more capable you will probably need less and less. Eventually we may hit the point where a large corporations internal data in any department is enough to RL into competence, and then it frankly doesn't matter for any field once individual conglomerates can start the flywheel.
This isn't an absurdity. Man can "RL" itself into competence in a single semester of material, a laughably small amount of training data compared to an LLM.
No comments yet
Contribute on Hacker News ↗