Comment by phpnode
9 hours ago
Claude doesn't know why it acted the way it acted, it is only predicting why it acted. I see people falling for this trap all the time
9 hours ago
Claude doesn't know why it acted the way it acted, it is only predicting why it acted. I see people falling for this trap all the time
That's because when the failure becomes the context, it can clearly express the intent of not falling for it again. However, when the original problem is the context, none of this obviousness applies.
Very typical, and gives LLMs the annoying Captain Hindsight -like behaviour.
Yes, this pitfall is a hard one. It is very easy to interpret the LLM in a way there is no real ground for.
It must be anthropomorphization that's hard to shake off.
If you understand how this all works it's really no surprise that reasoning post-factum is exactly as hallucinated as the answer itself and might have very little to do with it and it always has nothing to do with how the answer actually came to be.
The value of "thinking" before giving an answer is reserving a scratchpad for the model to write some intermediate information down. There isn't any actual reasoning even there. The model might use information that it writes there in completely obscure way (that has nothing to do what's verbally there) while generating the actual answer.
It's not even predicting why it acted, it's predicting an explanation of why it acted, which is even worse since there's no consistent mental model.
IDK how far AIs are from intelligence, but they are close enough that there is no room for anthropomorphizing them. When they are anthropomorphized its assumed to be a misunderstanding of how they work.
Whereas someone might say "geeze my computer really hates me today" if it's slow to start, and we wouldn't feel the need to explain the computer cannot actually feel hatred. We understand the analogy.
I mean your distinction is totally valid and I dont blame you for observing it because I think there is a huge misunderstanding. But when I have the same thought, it often occurs to me that people aren't necessarily speaking literally.
This is a sort of interesting point, it's true that knowingly-metaphorical anthropomorphisation is hard to distinguish from genuine anthropomorphisation with them and that's food for thought, but the actual situation here just isn't applicable to it. This is a very specific mistaken conception that people make all the time. The OP explicitly thought that the model would know why it did the wrong thing, or at least followed a strategy adjacent to that misunderstanding. He was surprised that adding extra slop to the prompt was no more effective than telling it what to do himself. It's not a figure of speech.
A good time to quote our dear leader:
> No one gets in trouble for saying that 2 + 2 is 5, or that people in Pittsburgh are ten feet tall. Such obviously false statements might be treated as jokes, or at worst as evidence of insanity, but they are not likely to make anyone mad. The statements that make people mad are the ones they worry might be believed. I suspect the statements that make people maddest are those they worry might be true.
People are upset when AIs are anthropomorphized because they feel threatened by the idea that they might actually be intelligent.
Hence the woefully insufficient descriptions of AIs such as "next token predictors" which are about as fitting as describing Terry Tao as an advanced gastrointestinal processor.
1 reply →
There's this underlying assumption of consistency too - people seem to easily grasp that when starting on a task the LLM could go in a completely unexpected direction, but when that direction has been set a lot of people expect the model to stay consistent. The confidence with which it answers questions plays tricks on the interlocutor.
Whats not a figure of speech?
I am speaking general terms - not just this conversation here. The only specific figure of speech I see in the original comment is "self reflection" which doesn't seem to be in question here.