Comment by refulgentis

3 months ago

This is a bug, and a regression, not a feature.

It's odd to see it recast as "you need to give better instructions [because it's different]" -- you could drop the "because it's different" part, and it'd apply to failure modes in all models.

It also begs the question of how it's different: and that's where the rationale gets cyclical. You have to prompt it different because it's different because you have to prompt it different.

And where that really gets into trouble is the "and that's the point" part -- as the other comment notes, it's expressly against OpenAI's documentation and thus intent.

I'm a yuge AI fan. Models like this are a clear step forward. But it does a disservice to readers to leave the impression that the same techniques don't apply to other models, and recasts a significant issue as design intent.

Looking at o1's behavior, it seems there's a key architectural limitation: while it can see chat history, it doesn't seem able to access its own reasoning steps after outputting them. This is particularly significant because it breaks the computational expressivity that made chain-of-thought prompting work in the first place—the ability to build up complex reasoning through iterative steps.

This will only improve when o1's context windows grow large enough to maintain all its intermediate thinking steps, we're talking orders of magnitude beyond current limits. Until then, this isn't just a UX quirk, it's a fundamental constraint on the model's ability to develop thoughts over time.

  • > This will only improve when o1's context windows grow large enough to maintain all its intermediate thinking steps, we're talking orders of magnitude beyond current limits.

    Rather than retaining all those steps, what about just retaining a summary of them? Or put them in a vector DB so on follow-up it can retrieve the subset of them most relevant to the follow-up question?

    • That’s kind of what (R/C)NNs did before the Attention is all you need paper introduced the attention mechanism. One of the breakthroughs that enabled GPT is giving each token equal “weight” through cross attention instead of letting them get attenuated in some sort of summarization mechanism.

  • Is that relevant here? the post discussed writing a long prompt to get a good answer, not issues with ex. step #2 forgetting what was done in step #1.

    • https://platform.openai.com/docs/guides/reasoning/advice-on-... this explains the bug. o1 can't see its past thinking. This would seem to limit the expressivity of the chain of thought. Maybe within one step it's UTM, but with the loss of memory, extra steps will be needed to make sure the right information is passed forward. The model is likely to start to forget key ideas which it had that it didn't write down in the output. This will tend to make it drift and start to focus more on its final statements and less (or not at all) on some of the things which led it to them.

    • Yes it is, because the post discussed this approach precisely because unrolling the actual chain of thought in interactive chat does not work.

      And it's doubly relevant because chain of thought let's transformers break out of TCO complexity and be UTM. This matters because TC0 is pattern matching while UTM is general intelligence. Forgetting what the model thought breaks this and (ironically) probably forces the model back into one-shot pattern matching. https://arxiv.org/abs/2310.07923

It's different because a chat model has been post-trained for chat, while o1/o3 have been post-trained for reasoning.

Imagine trying to have a conversation with someone who's been told to assume that they should interpret anything said to them as a problem they need to reason about and solve. I doubt you'd give them high marks for conversational skill.

Ideally one model could do it all, but for now the tech is apparently being trained using reinforcement learning to steer the response towards a singular training goal (human feedback gaming, or successful reasoning).

  • TFA, and my response, are about a de novo relationship between task completion and input prompt. Not conversational skill.

    • Yes, and the "de novo" explanation appears obvious as indicated - the model was trained differently - different reinforcement learning goals (reasoning vs human feedback for chat). The necessity for different prompting derives from the different operational behavior of a model trained in this way (to support self-evaluation based on the data present in the prompt, backtracking when veering away from the goals established in the prompt, etc - the handful of reasoning behaviors that have been baked into the model via RL).

I wouldn't be so harsh - you cold have a 4o style LLM turn vague user queries into precise constraints for an o1 style AI - this is how a lot of stable diffusion image generators work already.