← Back to context

Comment by Terr_

16 hours ago

> He’s not necessarily anthropomorphizing it, he’s showing that it went against every instruction he gave it.

It's deeper than that, there are two pitfalls here which are not simply poetic license.

1. When you submit the text "Why did you do that?", what you want is for it to reveal hidden internal data that was causal in the past event. It can't do that, what you'll get instead is plausible text that "fits" at the end of the current document.

2. The idea that one can "talk to" the LLM is already anthropomorphizing on a level which isn't OK for this use-case: The LLM is a document-make-bigger machine. It's not the fictional character we perceive as we read the generated documents, not even if they have the same trademarked name. Your text is not a plea to the algorithm, your text is an in-fiction plea from one character to another.

_________________

P.S.: To illustrate, imagine there's this back-and-forth iterative document-growing with an LLM, where I supply text and then hit the "generate more" button:

1. [Supplied] You are Count Dracula. You are in amicable conversation with a human. You are thirsty and there is another delicious human target nearby, as well as a cow. Dracula decides to

2. [Generated] pounce upon the cow and suck it dry.

3. [Supplied] The human asks: "Dude why u choose cow LOL?" and Dracula replies:

4. [Generated] "I confess: I simply prefer the blood of virgins."

What significance does that #4 "confession" have?

Does it reveal a "fact" about the fictional world that was true all along? Does it reveal something about "Dracula's mind" at the moment of step #2? Neither, it's just generating a plausible add-on to the document. At best, we've learned something about a literary archetype that exists as statistics in the training data.

I agree to the practical part of this, with two nuances:

The full data of what's in an LLM's "consciousness" is the conversation context. Just because it isn't hidden, doesn't necessarily mean it doesn't contain information you've overlooked.

Asking "why did you do that" won't reveal anything new, but it might surface some amount of relevant information (or it hallucinates, it depends which LLM you're using). "Analyse recent context and provide a reasonable hypothesis on what went wrong" might do a bit better. Just be aware that llm hypotheses can still be off quite a bit, and really need to be tested or confirmed in some manner. (preferably not by doing even more damage)

Just because you shouldn't anthropomorphize, doesn't mean an english capable LLM doesn't have a valid answer to an english string; it just means the answer might not be what you expected from a human.

  • > The full data of what's in an LLM's "consciousness" is the conversation context.

    No it's not, see research on hiddens states using SAE's and other methods. TBC, I agree with your second point, though I still believe top level OP was reckless and is now doing the businessman's version of throwing the dog under the bus.

    • We might actually be in full agreement. You can't get a faithful replay of these internal states. They're gone at end of generation. You can only query and re-derive from the visible context. Hence limited (though not zero) utility, depending on model, harness, and prompt.

Why is this getting downvoted? This is exactly what’s going on here. The LLM has no idea why it did what it did. All it has to go on is the content of the session so far. It doesn’t ‘know’ any more than you do. It has no memory of doing anything, only a token file that it’s extending. You could feed that token file so far into a completely different LLM and ask that, and it would also just make up an answer.

The best answer so far. It describes exactly what was going on. LLM users should read it twice, especially if "confession" didn't make your brain hurt a bit.

>it's just generating a plausible add-on to the document

A plausible document that follows the alignment that was done during the training process along with all of the other training where a LLM understanding its actions allows it to perform better on other tasks that it trained on for post training.

  • I don't understand what you're trying to say here.

    It sounds like "we know the LLM understood its actions... because it understood its actions when we trained it", which is circular-logic.

You don't seem to realize that humans also work this way.

If you ask a human why they did something, the answer is a guess, just like it is for an LLM.

That's because obviously there is no relationship between the mechanisms that do something and the ones that produce an explanation (in both humans and LLMs).

An example of evidence from Wikipedia, "split brain" article:

The same effect occurs for visual pairs and reasoning. For example, a patient with split brain is shown a picture of a chicken foot and a snowy field in separate visual fields and asked to choose from a list of words the best association with the pictures. The patient would choose a chicken to associate with the chicken foot and a shovel to associate with the snow; however, when asked to reason why the patient chose the shovel, the response would relate to the chicken (e.g. "the shovel is for cleaning out the chicken coop").[4]

  • Most humans don't have split brains, and without split brains you have quite a bit of insight into the thoughts in your brain. Its not perfect but its better than nothing, LLM have nothing since there is no mechanism for them to communicate forward except the text they read.

    • > Most humans don't have split brains, and without split brains you have quite a bit of insight into the thoughts in your brain. Its not perfect but its better than nothing, LLM have nothing since there is no mechanism for them to communicate forward except the text they read.

      I can't prove it but this is almost certainly one of those things that is uh, less than universal in the population.

  • > humans also work this way.

    I'm aware of the condition, but let's not confuse failure modes with operational modes. A human with leg problems might use a wheelchair, but that doesn't mean you've cracked "human locomotion" by bolting two wheels onto something.

    Also, while both brain-damaged humans and LLMs casually confabulate, I think there's some work to do before one can prove they use the same mechanics.