← Back to context

Comment by oofbey

21 hours ago

> It doesn’t decide to do something and then do it, it just outputs text.

We can debate philosophy and theory of mind (I’d rather not) but any reasonable coding agent totally DOES consider what it’s going to do before acting. Reasoning. Chain of thought. You can hide behind “it’s just autoregressively predicting the next token, not thinking” and pretend none of the intuition we have for human behavior apply to LLMs, but it’s self-limiting to do so. Many many of their behaviors mimic human behavior and the same mechanisms for controlling this kind of decision making apply to both humans and AI.

I suspect we are not describing the same thing.

When a human asks another human “why did you do X?”, the other human can of course attempt to recall the literal thoughts they had while they did X (which I would agree with you are quite analogous to the LLMs chain of thought).

But they can do something beyond that, which is to reason about why they may have the beliefs that they had.

“Why did you run that command?”

“Because I thought that the API key did not have access to the production system.”

When a human responds with this they are introspecting their own mind and trying to project into words the difference in understanding they had before and after.

Whereas for an agent it will happily include details that are not literally in its chain of thought as justifications for its decisions.

In this case, I would argue that it’s not actually doing the same thing humans do, it is creating a new plausible reason why an agent might do the thing that it itself did, but it no longer has access to its own internal “thought state” beyond what was recorded in the chain of thought.

  • > Whereas for an agent it will happily include details that are not literally in its chain of thought as justifications for its decisions.

    Humans do this too, ALL THE TIME. We rationalize decisions after we make them, and truly believe that is why we made the decision. We do it for all sorts of reasons, from protecting our ego to simply needing to fill in gaps in our memory.

    Honestly, I feel like asking an AI it’s train of thought for a decision is slightly more useful than asking a human (although not much more useful), since an LLM has a better ability to recreate a decision process than a human does (an LLM can choose to perfectly forget new information to recreate a previous decision).

    Of course, I don’t think it is super useful for either humans or LLMs. Trying to get the human OR LLM to simply “think better next time” isn’t going to work. You need actual process changes.

    This was a rule we always had at my company for any after incident learning reviews: Plan for a world where we are just as stupid tomorrow as we are today. In other words, the action item can’t be “be more careful next time”, because humans forget sometimes (just like LLMs). You will THINK you are being careful, but a detail slips your mind, or you misremember what situation you are in, or you didn’t realize the outside situation changed (e.g. you don’t realize you bumped the keyboard and now you are typing in another console window).

    Instead, the safety improvements have to be about guardrails you put up, or mitigations you put in place to prevent disaster the NEXT time you fail to be as careful as you are trying to be.

    Because there is always a next time.

    Honestly, I think the biggest struggle we are having with LLMs is not knowing when to treat it like a normal computer program and when to treat it like a more human-like intelligence. We run across both issues all the time. We expect it to behave like a human when it doesn’t and then turn around and expect it to behave like a normal computer program when it doesn’t.

    This is BRAND NEW territory, and we are going to make so many mistakes while we try to figure it out. We have to expect that if you want to use LLMs for useful things.

    • Plan for a world where we are just as stupid tomorrow as we are today. In other words, the action item can’t be “be more careful next time”, because humans forget sometimes (just like LLMs).

      That’s a great way of putting it, I’ll remember that one (except when I forget...)

      1 reply →

    • Humans don't do this all the time. I think you are conflating things to further this false idea that there is no distance between human thinking and the behavior of LLMs. The kind of rationalization humans sometimes do generally happens over a period of time. Humans are also not "rationalizing" their actions all the time. Also, when humans do what you call "rationalizing," it is to serve some kind of interest, beyond responding to a prompt.

I agree with you a LLM is perfectly capable of explaining its actions.

However it cannot do so after the fact. If there's a reasoning trace it could extract a justification from it. But if there isn't, or if the reasoning trace makes no sense, then the LLM will just lie and make up reasons that sound about right.