Comment by rmunn
1 day ago
AlphaZero playing games against itself was useful because there's an objective measure of success in a game of Go: at the end of the game, did I have more points than my opponent? So you can "reward" the moves that do well, and "punish" the moves that do poorly. And that objective measure of success can be programmed into the self-training algorithm, so that it doesn't need human input in order to tell (correctly!) whether its model is improving or getting worse. Which means you can let it run in a self-feedback loop for long enough and it will get very good at winning.
What's the objective measure of success that can be programmed into the LLM to self-train without human input? (Narrowing our focus to only code for this question). Is it code that runs? Code that runs without bugs? Code without security holes? And most importantly, how can you write an automated system to verify that? I don't buy that E2E project simulations would work: it can simulate the results, but what results is it looking for? How will it decide? It's the evaluation, not the simulation, that's the inescapably hard part.
Because there's no good, objective way for the LLM to evaluate the results of its training in the case of code, self-training would not work nearly as well as it did for AlphaZero, which could objectively measure its own success.
No comments yet
Contribute on Hacker News ↗