Comment by mediaman
1 day ago
RLVR is not offline learning. It's not learning from a static dataset. These are live rollouts that are being verified and which update the weights at each pass based on feedback from the environment.
You might argue that traditional RL involves multiple states the agent moves through. But autoregressive LLMs are the same: a forward pass generating a token also creates change in state.
After training, the weights are fixed, of course, but that is the case of most traditional RL systems. RL does not intrinsically mean a continual updating of weights in deployment, which carries a bunch of problems.
From the premise that RLVR can be applied to benchmaxx (true!) it does not follow that it therefore is only good for that.
No comments yet
Contribute on Hacker News ↗