Comment by musebox35
5 days ago
The most successful applications like coding are not the result of pure LLM/generative modeling. They come from closing the loop with an agentic harness. The generate-test-selectively refine loop is the core modality of scientific work. An LLM + RL with Verifiable Rewards + feedback from compiler/terminal runs mimics this process to a great extend.
This is Fisher/Box feedback loop (https://www-sop.inria.fr/members/Ian.Jermyn/philosophy/writi...) implemented on a modern computational system. LLM is just a component. I wish Sutton had commented on this fuller picture of what we have now instead of commenting just on the LLM/Backprop side of things. I am honestly curious of whether such a loop can at least partially automate discovery.
There are more elements to discovery though. It is still not clear where the initial working model/hypothesis comes from or how the updates are selected (unless it is just parameter induction). I recently read about Hanson's Patterns of Discovery which aims in that direction. I have still not read it, but I am curious if it has any mechanistic clues.
> There are more elements to discovery though. It is still not clear where the initial working model/hypothesis comes from or how the updates are selected
That is a problem in RL, so we usually do supervised training first, teach it to imitate some trajectories, then do RL to refine the model. RL alone has a huge problem because it might be hard to reach a reward, hence hard to learn the task by pure reinforcement. Humans also combine supervision (learn from books) with search (solving problems) to break the discovery problem. For example, a human with no initial instruction in math would not produce great results no matter how smart they are. The bootstrap was exploration paid for in the past.
SFT + RL connection to model/hypothesis search is insightful. Brute force / scalable search is where Sutton's Bitter Lesson also points to. Once your search domain is small compared to your search budget, that makes a lot of sense.
If I get your meaning right, SFT creates the right inductive bias so that the RL search + reward guidance does the trick.
For novel discovery, the question might then be whether the inductive bias builds a strong enough prison so no new discovery is possible by RL or if the search can escape the boundaries set by SFT given enough randomization and the right reward function.
I know that RL is usually not performed at inference time, but in-context learning mechanisms might be developed by RL to discover at test time. Edit: I would love to hear if that actually happens or not, like new induction heads (https://transformer-circuits.pub/2022/in-context-learning-an...) forming during RL. I really have no idea.
the role of evolution is always a confounding factor as well and all the various analogies to how it maps onto AI research are always not quite satisfactory.
Completely agree on the importance of the harness.
The problem I see is the same problem Evolutionary Algorithms had: you can generate potential solutions until you run out of cash, but you still need to evalulate those solutions. You need a fitness function, and that means you need to at least know the general shape of the solution. If anyone knows of any work towards more open-ended fitness functions, I'd love to read it.
Just some speculation, but, I think humans have on the one hand a lot of degrees of freedom in behaviors and thoughts they can do, but at the same time all that freedom is reigned in by our biological needs, like preserving the integrity of our body, but also preserve the integrity of our minds. But this extends further to preserving our surroundings (for our safety, a changing environment brings uncertainty), but also of people we care about and even entire societies that we have. And preserving our future selves through prediction of future environments.
So all that is to say, I'm not sure it is even theoretically possible to create a single algorithm to do open ended search and evaluation. Biology has billions of years of evolution and accumulation, whereas a simple algorithm in a computer, even if smart and connected to the real world, has no such accumulation.
I think humans hit the perfect sweet spot where we have the simplicity of the self preservation instinct, but we have the complexity of the cortex and lots of degrees of freedom because of it, plus on top of that we have a lot of accumulated degrees of freedom in the society and technology and knowledge that have we, which has been built up for thousands of years, all of which we can't just create an algorithm to encapsulate without going through the actual evolution.
And just to make it explicit - a large percentage of what humans think derives from an instinct to preserve the self, the mind, the future and the environment, even if it is very abstract at times. Not absolutely all, but I think a good chunk. And the complexity and degrees of freedom comes from that we have so many neurons in the brain, and a complex body with hands and whatever else that allows a lot of behaviors, as well as a complex environment that is constantly challenging us.
> If anyone knows of any work towards more open-ended fitness functions, I'd love to read it.
There is research in open-ended learning, see "Why Greatness Cannot Be Planned" by Kenneth O. Stanley. The core idea is that in open-ended scenarios you don't know what action was good except in hindsight because your path is deceptive. So the idea is to replace fitness with novelty search which provides more stepping stones towards the goal.
The TRM architecture models both the problem and the solution at the same time. You might find it an interesting read.
https://arxiv.org/abs/2510.04871
Seems to a layperson like myself that in Math they're using Lean and in programming contexts they're using compilers, such that the models themselves tend towards embedding that determinism "intuitively".
Yes it seems most anti-LLM researchers take issue with LLMs on fundamental math/architecture based properties, but seem to miss all the engineering going on around the model to make it useful.
Those mathematical shortcomings very well might mean they arent a path to true AGI, but that honestly seems fairly irrelevant at this point tbh.
> The generate-test-selectively refine loop is the core modality of scientific work.
Can you expand on this? Because when I think of it, it calls to mind p-hacking and selective publishing more than anything.
Like, when you try prompts over and over and over until you get code that finally works. At which point you stop and pronounce the dubious claim, "AI is amazing, look what it does!"
Most importantly, the reinforcement loop is used during training. I don't agree with Sutton's original hypothesis, but it holds even less after reinforcement learning.
RLVR still does not expand beyond the base distribution though, it only mode-seeks within it.
i.e, evaluation, retention yes. variation or "planning" no.
That is not to say you cannot use LLMs. Alpha evolve does exactly that. It uses an external simple evolutionary planner. The overarching point he's making is that our planner is still "dumb" and we need to work on it.
When you iteratively guide an LLM in claude code, you are the external planner. That also works.