Comment by badmonster

2 days ago

Why do LLMs struggle so much with recovering from early wrong turns in multi-turn conversations — even when all prior context is available and tokenized?

Is it due to the model's training distribution (mostly single-shot completions), the way context windows are encoded, or an architectural bottleneck?

Feels like there's no dynamic internal state that evolves over the conversation — only a repeated re-parsing of static history. Has anyone seen work on integrating memory/state mechanisms that allow belief revision within a session, not just regurgitation of past tokens?

We shouldn’t anthropomorphize LLMs—they don’t “struggle.” A better framing is: why is the most likely next token, given the prior context, one that reinforces the earlier wrong turn?

Imagine optimizing/training on a happy path.

When you generate future tokens, you're looking at history tokens that are happy.

So how can a model, given sad tokens, generate future happy tokens if it did not learn to do so?

The work you're looking for is already here, it's "thinking". I assume they include sad tokens in the dataset, produce "thinking", which should result in happy tokens coming after thinking tokens. If thinking is bad (by looking at following happy tokens), then it's punished, if good, then descent.