Comment by phpnode

13 days ago

Claude doesn't know why it acted the way it acted, it is only predicting why it acted. I see people falling for this trap all the time

It's not even predicting why it acted, it's predicting an explanation of why it acted, which is even worse since there's no consistent mental model.

It had been shown that LLMs don't know how they work. They asked a LLM to perform computations, and explain how they got to the result. The LLM explanation is typical of how we do it: add number digit by digit, with carry, etc... But by looking inside the neural network, it show that the reality is completely different and much messier. None of it is surprising.

Still, feeding it back its own completely made up self-reflection could be an effective strategy, reasoning models kind of work like this.

  • Right. Last time I checked this was easy to demonstrate with word logic problems:

    "Adam has two apples and Ben has four bananas. Cliff has two pieces of cardboard. How many pieces of fruit do they have?" (or slightly more complex, this would probably be easily solved, but you get my drift.)

    Change the wordings to some entirely random, i.e. something not likely to be found in the LLM corpus, like walruses and skyscrapers and carbon molecules, and the LLM will give you a suitably nonsensical answer showing that it is incapable of handling even simple substitutions that a middle schooler would recognize.

  • The explanation becomes part of the context which can lead to more effective results in the next turn, it does work, but it does so in a completely misleading way

  • Which should be expected, since the same is true for humans. The "adding numbers digit by digit with carry" works well on paper, but it's not an effective method for doing math in your head, and is certainly not how I calculate 14+17. In fact I can't really tell you how I calculate 14+17 since that's not in the "inner monologue" part of my brain, and I have little introspection in any of the other parts

    Still, feeding humans their completely made-up self-reflection back can be an effective strategy

    • The difference is that if you are honest and pragmatic and someone asked you how you added two numbers, you would only say you did long addition if that's what you actually did. If you had no idea what you actually did, you would probably say something like "the answer came to me naturally".

      LLMs work differently. Like a human, 14+17=31 may come naturally, but when asked about their though process, LLMs will not self-reflect on their condition, instead they will treat it like "in your training data, when someone is asked how he added number, what follows?", and usually, it is long addition, so that is the answer you will get.

      It is the same idea as to why LLMs hallucinate. They will imitate what their dataset has to say, and their dataset doesn't have a lot of "I don't know" answers, and a LLM that learns to answer "I don't know" to every question wouldn't be very useful anyways.

      1 reply →

    • Life lesson for you: the internal functions of every individual's mind are unique. Your n=1 perspective is in no way representative of how humans as a category experience the world.

      Plenty of humans do use longhand arithmetic methods in their heads. There's an entire universe of mental arithmetic methods. I use a geometric process because my brain likes problems to fit into a spatial graph instead of an imaginary sheet of paper.

      Claiming you've not examined your own mental machinery is... concerning. Introspection is an important part of human psychological development. Like any machine, you will learn to use your brain better if you take a peek under the hood.

      1 reply →

Yes, this pitfall is a hard one. It is very easy to interpret the LLM in a way there is no real ground for.

  • It must be anthropomorphization that's hard to shake off.

    If you understand how this all works it's really no surprise that reasoning post-factum is exactly as hallucinated as the answer itself and might have very little to do with it and it always has nothing to do with how the answer actually came to be.

    The value of "thinking" before giving an answer is reserving a scratchpad for the model to write some intermediate information down. There isn't any actual reasoning even there. The model might use information that it writes there in completely obscure way (that has nothing to do what's verbally there) while generating the actual answer.

That's because when the failure becomes the context, it can clearly express the intent of not falling for it again. However, when the original problem is the context, none of this obviousness applies.

Very typical, and gives LLMs the annoying Captain Hindsight -like behaviour.

IDK how far AIs are from intelligence, but they are close enough that there is no room for anthropomorphizing them. When they are anthropomorphized its assumed to be a misunderstanding of how they work.

Whereas someone might say "geeze my computer really hates me today" if it's slow to start, and we wouldn't feel the need to explain the computer cannot actually feel hatred. We understand the analogy.

I mean your distinction is totally valid and I dont blame you for observing it because I think there is a huge misunderstanding. But when I have the same thought, it often occurs to me that people aren't necessarily speaking literally.

  • This is a sort of interesting point, it's true that knowingly-metaphorical anthropomorphisation is hard to distinguish from genuine anthropomorphisation with them and that's food for thought, but the actual situation here just isn't applicable to it. This is a very specific mistaken conception that people make all the time. The OP explicitly thought that the model would know why it did the wrong thing, or at least followed a strategy adjacent to that misunderstanding. He was surprised that adding extra slop to the prompt was no more effective than telling it what to do himself. It's not a figure of speech.

    • A good time to quote our dear leader:

      > No one gets in trouble for saying that 2 + 2 is 5, or that people in Pittsburgh are ten feet tall. Such obviously false statements might be treated as jokes, or at worst as evidence of insanity, but they are not likely to make anyone mad. The statements that make people mad are the ones they worry might be believed. I suspect the statements that make people maddest are those they worry might be true.

      People are upset when AIs are anthropomorphized because they feel threatened by the idea that they might actually be intelligent.

      Hence the woefully insufficient descriptions of AIs such as "next token predictors" which are about as fitting as describing Terry Tao as an advanced gastrointestinal processor.

      2 replies →

    • There's this underlying assumption of consistency too - people seem to easily grasp that when starting on a task the LLM could go in a completely unexpected direction, but when that direction has been set a lot of people expect the model to stay consistent. The confidence with which it answers questions plays tricks on the interlocutor.

    • Whats not a figure of speech?

      I am speaking general terms - not just this conversation here. The only specific figure of speech I see in the original comment is "self reflection" which doesn't seem to be in question here.

It’s not even doing that. It’s just an algorithm for predicting the next word. It doesn’t have emotions or actually think. So, I had to chuckle when it said it was arrogant. Basically, it’s training data contains a bunch of postmortem write ups and it’s using those as a template for what text to generate and telling us what we want to hear.