Comment by highfrequency
5 days ago
The model’s distribution will certainly change from the base model’s output distribution during reinforcement learning, shifting toward outputs that score well on an external evaluation. This is very different from mode-seeking. Am I missing something?
Mode-seeking is describing the way in which the distribution changes. RL is capable of picking out slightly lower probability trajectories and moving them toward the top of the distribution. However, exploration is fundamentally limited by the base policy itself. If a trajectory has near-zero probability under the original model, RLVR is unlikely to discover it because it must first be sampled before it can be rewarded. External search/planning methods such as MCTS or evolutionary search are useful precisely because they can explore candidate trajectories beyond what the policy would ordinarily generate. This is also not theoretical, GRPO style methods are shown to mostly improve `maj@k` and `pass@1` evals while not so much `pass@k` especially for high k, meaning it mostly sharpening the top of the distribution.
I'm not saying this makes it useless - it clearly helps for math and coding tasks. But the ceiling exists and that's what the original tweet was referring to. Alpha evolve also shows what lies beyond the ceiling, altho their planner was rudimentary.
Sure, but I'd say that moving desirable trajectories from very low probability to high probability is characteristic of genuine human learning and discovery. Technically, quantum gravity, a bestselling novel, or a yet undiscovered proof of the Riemann Hypothesis is "in my distribution", but when we are talking about a long chain of unlikely token completions (with multiplicative probabilities), whether that trajectory lives in the tail of the distribution vs. in the mode makes all the difference.
Would you agree that it is a matter of degrees rather than a qualitative distinction? There seems to be a broad misconception in Sutton and others that output quality cannot exceed that of the base internet distribution; my point is that RL allows you to easily produce an output distribution that is better than whatever data you trained on according to some evaluation criteria. There are no clear theoretical limits on how much better it can get, rather there are many people asserting guesses that there is an upper bound and it lives below "human creativity." I just haven't seen any solid theoretical argument, and the empirical evidence has so far shown continual improvement.
Also, I would be keen to look at any sources you have of pass@k not improving much during GRPO.
I said slightly lower, I meant it. It's virtually impossible to sample a trajectory that is really really low probability (say, by smoothening the distribution before sampling) without incurring crazy amounts of noise. And only when you sample it, can you reward it and do the update.
Again, no one is saying models can't improve beyond the internet i.e data distribution! They clearly can. The claim is that RL without real exploration cannot exceed the base models distribution, which by virtue of SGD _does_ generalize.
And also, it doesn't mean it's not useful. Improving sample efficiency and making something that happens 1 in 15 times happen 1 in 1.2 times is insanely useful and is what has enabled the kind of coding agents we have today.
Sutton, especially, I doubt has a misconception about this :)
> pass@k
Yeah, AFK now. But it's a well researched thing. You can look for more, but here's one off the top of my head: https://openreview.net/forum?id=4OsgYD7em5 The original deepseek paper also had the result, i.e the paper that first got famous for using grpo as a method that works for LLMs. A side result in one of these papers I forget which one, is that the base model converges in performance with the RLd one at high k.
2 replies →