← Back to context

Comment by porridgeraisin

5 days ago

RLVR still does not expand beyond the base distribution though, it only mode-seeks within it.

i.e, evaluation, retention yes. variation or "planning" no.

That is not to say you cannot use LLMs. Alpha evolve does exactly that. It uses an external simple evolutionary planner though. The overarching point he's making is that our planner is still "dumb" and we need to work on it.

When you iteratively guide an LLM in claude code, you are the external planner. That also works.

> RLVR still does not expand beyond the base distribution though, it only mode-seeks within it.

Seems clearly false. Pretraining finds the mean/mode of the data distribution. RL can easily generate many samples around that mode, evaluate them on an external source of truth (eg compile the code and run it) and then selectively train on the good samples. This clearly can go beyond the initial data distribution.

  • by base distribution, I meant the base model's output distribution

    • The model’s distribution will certainly change from the base model’s output distribution during reinforcement learning, shifting toward outputs that score well on an external evaluation. This is very different from mode-seeking. Am I missing something?

      5 replies →