← Back to context

Comment by visarga

5 days ago

> There are more elements to discovery though. It is still not clear where the initial working model/hypothesis comes from or how the updates are selected

That is a problem in RL, so we usually do supervised training first, teach it to imitate some trajectories, then do RL to refine the model. RL alone has a huge problem because it might be hard to reach a reward, hence hard to learn the task by pure reinforcement. Humans also combine supervision (learn from books) with search (solving problems) to break the discovery problem. For example, a human with no initial instruction in math would not produce great results no matter how smart they are. The bootstrap was exploration paid for in the past.

SFT + RL connection to model/hypothesis search is insightful. Brute force / scalable search is where Sutton's Bitter Lesson also points to. Once your search domain is small compared to your search budget, that makes a lot of sense.

If I get your meaning right, SFT creates the right inductive bias so that the RL search + reward guidance does the trick.

For novel discovery, the question might then be whether the inductive bias builds a strong enough prison so no new discovery is possible by RL or if the search can escape the boundaries set by SFT given enough randomization and the right reward function.

I know that RL is usually not performed at inference time, but in-context learning mechanisms might be developed by RL to discover at test time. Edit: I would love to hear if that actually happens or not, like new induction heads (https://transformer-circuits.pub/2022/in-context-learning-an...) forming during RL. I really have no idea.

the role of evolution is always a confounding factor as well and all the various analogies to how it maps onto AI research are always not quite satisfactory.