← Back to context

Comment by reqo

1 year ago

> This ‘goal drift’ means that agents, or tasks done in a sequence with iteration, get less reliable. It ‘forgets’ where to focus, because its attention is not selective nor dynamic.

I don't know if I agree with this. The attention module is specifically designed to be selective and dynamic, otherwise it would not be much different than a word embedding (look up "soft" weights vs "hard" weights [1]). I think deep learning should not be confused with deep RL. LLMs are autoregressive models which means that they are trained to predict the next token and that is all they do. The next token is not necessarily the most reasonable (this is why datasets are super important for better performance). Deep RL models on the other hand, seem to be excellent at agency and decision making (although in restricted environment), because they are trained to do so.

[1] https://en.wikipedia.org/wiki/Attention_(machine_learning)

LLMs are fine-tuned with RL too. They are NOT simply next token predictors. RLHF uses whole answers at once to generate gradients, so it is looking further into the future. This might not be perfect but it is clearly more than focusing just 1 token ahead.

In the future the RL part of LLM training will increase a lot. Why am I saying this? There are two sources for learning - the past and the present. Training on human text is using past data, that is off-policy. But training on interactive data is on-policy. There is nothing we know that doesn't come from the environment. What is not written in any books must be learned from outside.

That is why I think supervised pre-training from human text is just half the story and RL based agent learning, interactivity in other words, is the next step. The two feed on which intelligence stands are language (past experience) and environment (present experience). We can't get ahead without both of them.

AlphaZero showed what an agent can learn from an environment alone, and LLMs show what they can learn from humans. But the world is big, there are plenty of environments that can provide learning signal, in other words feedback to LLMs.