I might be missing something here as a non-expert, but isn’t chain-of-thought essentially asking the model to narrate what it’s “thinking,” and then monitoring that narration?
That feels closer to injecting a self-report step than observing internal reasoning.
Kind of. The narration is an actual part of the thinking process. Just not the only part.
It can reflect the thinking process fully, or it can be full of post hoc justifications. In practice, it's anything in between.
As task complexity increases and chain-of-thought length grows, it becomes load-bearing by necessity. It still doesn't have to be fully accurate, but it must be doing something right, or the answer wouldn't work.
When we think, our thoughts are composed of both nonverbal cognitive processes (we have access to their outputs, but generally lack introspective awareness of their inner workings), and verbalised thoughts (whether the “voice in your head” or actually spoken as “thinking out loud”).
Of course, there are no doubt significant differences between whatever LLMs are doing and whatever humans are doing when they “think” - but maybe they aren’t quite as dissimilar as many argue? In both cases, there is a mutual/circular relationship between a verbalised process and a nonverbal one (in the LLM case, the inner representations of the model)
It is what it is thinking consciously / its internal narrative. For example a supervillain's internal narrative with their plans would go into their COT notepad. If we want to really lean into the analogy between human psychology and LLMs. The "internal reasoning" that people keep referencing in this thread.. referring to the transformer weights and inscrutable inner working of a GPT.. isn't reasoning, but more like instinct, or the subconscious.
I might be missing something here as a non-expert, but isn’t chain-of-thought essentially asking the model to narrate what it’s “thinking,” and then monitoring that narration?
That feels closer to injecting a self-report step than observing internal reasoning.
Kind of. The narration is an actual part of the thinking process. Just not the only part.
It can reflect the thinking process fully, or it can be full of post hoc justifications. In practice, it's anything in between.
As task complexity increases and chain-of-thought length grows, it becomes load-bearing by necessity. It still doesn't have to be fully accurate, but it must be doing something right, or the answer wouldn't work.
the chain of thought is what it is thinking
When we think, our thoughts are composed of both nonverbal cognitive processes (we have access to their outputs, but generally lack introspective awareness of their inner workings), and verbalised thoughts (whether the “voice in your head” or actually spoken as “thinking out loud”).
Of course, there are no doubt significant differences between whatever LLMs are doing and whatever humans are doing when they “think” - but maybe they aren’t quite as dissimilar as many argue? In both cases, there is a mutual/circular relationship between a verbalised process and a nonverbal one (in the LLM case, the inner representations of the model)
4 replies →
Chain-of-thought is a technical term in LLMs — not literally “what it’s thinking.”
As far as I understand it, it’s a generated narration conditioned by the prompt, not direct access to internal reasoning.
It is text that describes a plausible/likely thought process that conditions future generation by it's presence in the context.
3 replies →
Wrong to the point of being misleading. This is a goal, not an assumption
Source: all of mechinterp
It is what it is thinking consciously / its internal narrative. For example a supervillain's internal narrative with their plans would go into their COT notepad. If we want to really lean into the analogy between human psychology and LLMs. The "internal reasoning" that people keep referencing in this thread.. referring to the transformer weights and inscrutable inner working of a GPT.. isn't reasoning, but more like instinct, or the subconscious.
1 reply →
this is not correct
Related check out chain of draft if you haven't.
Similar performance with 7% of tokens as chain of thought.
https://arxiv.org/abs/2502.18600
That's a comparison to "CoT via prompting of chat models", not "CoT via training reasoning models with RLVR", so it may not apply.
This seems remarkably less safe?
Would would we want to purposely decrease interpretability?
Very strange.
> Our expectation is that combining multiple approaches—a defense-in-depth strategy—can help cover gaps that any single method leaves exposed.
Implement hooks in codex then.