← Back to context

Comment by imtringued

1 year ago

I was onboard with the article up until the middle. After the conclusion where the author simply gives up I felt like it dragged on way too much.

His attempts at training on Conway's game of life are kind of pathetic. The problem isn't a lack of training data and neither is it's "distribution". The fallacy lies in the fact that the dataset itself doesn't contain reasoning in the first place. For example, GitHub CoPilot has fill in the middle capability, while ChatGPT by default does not.

Now here is the shocker about the fill in the middle capability. How does the LLM learn to do it? It does it in an incredibly primitive way. Instead of building a model that can edit its own context, it receives a marker in the context that tells it about the cursor position and then it is finetuned on the expected response.

This means that an LLM could be trained to insert its token at any position in the context or even replace existing tokens, but here is the problem: Once the model has modified its own context, it has exited the training dataset. How do you evaluate the intermediate steps, which can consist of genuinely novel thoughts which are required, but not present in the data? Adding two numbers requires intermediate states which the model may even know how to produce, but it can never be rewarded to utilize them, if they aren't in the training data, because for the LLM, the only goal is to conform to the dataset.

If you wanted to avoid this, you would need to define a metric which allows the model to be rewarded for a success even if that success took a detour. Currently, training is inherently built around the idea of zero shot responses.