Comment by itissid
3 months ago
> The problem is that RL is extremely inefficient.
Wait What? That is an odd way of defining it. That's like saying turing machines are inefficient way to solve TSP. You would , at the least, want to define this in terms of complexity or put this into context of domains and observability.
RL's by definition is a field that is about finding efficient problems in the domain of choice[1]. There are likely regimes in LLM/LRM learning where RL can be quite efficient, polynomial time even in the state space, we just need to explore and find them. For example you can use Dynamic Programming as a "more" efficient way to solve MDPs[1] because it is polynomial in the state space X Action space.
[1]https://web.stanford.edu/class/psych209/Readings/SuttonBarto...
RL provides very poor training signal for deep learning, an order of magnitude or more worse than supervised learning. Better than nothing of course.
What the OP suggested is similar to training a transformer from scratch using RL (ie. no training tokens) towards an objective of steering a pretrained LLM to produce human readable output. It will probably not even converge, and if it does it would take immense compute.
In the case of supervised problem domains, you implicitly make a decision about what is signal, and what is noise, and sure, in that closed setting, supervised learning is much more sample efficient. But I think what we're learning now is that with strong enough base models, 'aha' moments in RL training show that it might be possible to essentially 'squeeze out signal from language itself', giving you far greater breadth of latent knowledge than supervised examples, and letting you train to generalize to far greater horizons than a fixed dataset might allow. In a fascinating way it is rather reminiscent of, well, abiogenesis. This might sound like speculative claptrap if you look at the things the current generation of models are still weak at, but... there's a real chance that there is a very heavy tail to the set of outcomes in the limit.
With a pretrained LLM most of the work is done. RL just steers the model into a 'thinking' mode. There is enough signal for that to work and for the inefficiency to not matter.
The downside is that you are limiting the model to think in the same language it outputs. An argument could be made that this is not how all humans think. I know that I rarely think in language or even images, just concepts (probably isn't even the right word) mix and transform and often I don't even bother to make the transformation to language at the end, just action.
1 reply →