Comment by porridgeraisin
4 days ago
Pass@128 is a lot. They were not easy.
> Discovery / creativity
I'm absolutely uninterested in the semantic discussions of what is a real discovery, what is creativity, what is intelligence, etc. I simply don't care. If it's useful great use it. If it's not great don't.
> How small p can be
All that depends on your sampling procedure. If you intentionally smooth the distribution out you can sample the smallest thing, but you pay for it with noise. Taken to an extreme, this is the monkeys typing on the keyboard argument.
It's a mathematical fact that RL cannot improve things it doesn't sample. In any learned distribution you pay a heavy cost by sampling far away from the mode. Most RL algos sample rollouts maybe with some smoothing but that's it. This is why external planners are necessary in order to sample something effectively un-sampleable in the base distribution. Simple example: tool use!
Sutton and everyone are simply calling for a focus on improving these external planners in the same way, as they also enable much better "continual" learning and so on.
> Erdos solution
The RL was what enabled such a huge trajectory to ever become efficiently sampleable in our lifetimes probably. You can do many useful things like this and more purely with the base model distribution.
In fact. Doing RL on user chats and so on especially from pair coding sessions are improving these models coding abilities by a lot making them even more reliable for SWE. In this regard, mode-seeking is a win.
> All sequences are technically in distribution
If it was truly improving 1 in million things systemically, then you wouldn't see base getting the same results given many samples. Albeit they are not erdos problems.
Could it be that at 1T scale, and for difficult problems specifically, grpo somehow filters through the noise and picks out the 1 in trillion? Extremely unlikely (you have your expected rollouts required to sample that, and then you have your sparse reward signal and no credit assignment on top of that...). But of course, only 2 companies in the world can do experiments with it, so there could be some unknown effect the rest of the world has not seen. Barring that, no.
No comments yet
Contribute on Hacker News ↗