Comment by highfrequency

9 days ago

Unless I'm missing something, this argument seems to apply only to the original pretraining era (eg GPT 1-4). The post-training and reinforcement learning paradigms are clearly doing variation, evaluation and selective retention no?

RLVR still does not expand beyond the base distribution though, it only mode-seeks within it.

i.e, evaluation, retention yes. variation or "planning" no.

That is not to say you cannot use LLMs. Alpha evolve does exactly that. It uses an external simple evolutionary planner though. The overarching point he's making is that our planner is still "dumb" and we need to work on it.

When you iteratively guide an LLM in claude code, you are the external planner. That also works.

  • > RLVR still does not expand beyond the base distribution though, it only mode-seeks within it.

    Seems clearly false. Pretraining finds the mean/mode of the data distribution. RL can easily generate many samples around that mode, evaluate them on an external source of truth (eg compile the code and run it) and then selectively train on the good samples. This clearly can go beyond the initial data distribution.

The transcript does seem to overlook post-training steps like Reinforcement Learning with Verifiable Rewards (RLVR) (but I'll certainly won't claim that Rich Sutton is unaware of such things; RLVR has a very narrow set of evaluation approaches).

I wonder if this is a precursor to Keen Tech leaning into David Silver's Ineffable Intelligence approach.

  • This was exactly what I was thinking of. RLVR is the secret sauce behind o3 and its many successors.

    Its the secret sauce behind why the current models are so great at coding and soon to be unbeatable at math.

    LLMs can pose many questions and if they are easily verifiable, fine tune very heavily. A lot of the world models discussion will inevitable lean into simulations as verification.

    • I'll admit that I miss having access to the ChatGPT 4.5 "absolutely gigantic model" with enough tuning to make it sane and useful. The RLVR models are superb for actual tasks in those RLVR domains, but that fine tuned view of the world as a verifiable problem to solve makes them feel worse for touchy feely stuff. Even for medical consultation and diagnosis, RLVR model's urge to reach a conclusion often is a liability.

      1 reply →