Comment by vjerancrnjak

2 days ago

Imagine optimizing/training on a happy path.

When you generate future tokens, you're looking at history tokens that are happy.

So how can a model, given sad tokens, generate future happy tokens if it did not learn to do so?

The work you're looking for is already here, it's "thinking". I assume they include sad tokens in the dataset, produce "thinking", which should result in happy tokens coming after thinking tokens. If thinking is bad (by looking at following happy tokens), then it's punished, if good, then descent.