← Back to context

Comment by visarga

1 year ago

LLMs are fine-tuned with RL too. They are NOT simply next token predictors. RLHF uses whole answers at once to generate gradients, so it is looking further into the future. This might not be perfect but it is clearly more than focusing just 1 token ahead.

In the future the RL part of LLM training will increase a lot. Why am I saying this? There are two sources for learning - the past and the present. Training on human text is using past data, that is off-policy. But training on interactive data is on-policy. There is nothing we know that doesn't come from the environment. What is not written in any books must be learned from outside.

That is why I think supervised pre-training from human text is just half the story and RL based agent learning, interactivity in other words, is the next step. The two feed on which intelligence stands are language (past experience) and environment (present experience). We can't get ahead without both of them.

AlphaZero showed what an agent can learn from an environment alone, and LLMs show what they can learn from humans. But the world is big, there are plenty of environments that can provide learning signal, in other words feedback to LLMs.