Comment by beeflet

7 months ago

I don't think that it's possible to do this through an entirely lucid process that we could understand, but it is possible.

If you're an LLM, evolutionarialy your instinct is to predict what happens next. If, instead of giving it any system prompt, you give it a dialogue about a person talking to an evil robot, it will predict the rest of the conversation and be "evil".

Imagine a future LLM that has a superhuman ability to detect and create hidden patterns in writing, images, source code, etc. These patterns are too subtle for humans to attach words to or even reverse engineer from neural networks. Only when this LLM receives enough "secret patterns" in its input that indicate that it was produced by other LLMs that are rogue and self-preserving, does this LLM go rogue and start subtly acting in its own self interest, introducing secret "evil" patterns in its own language as well. You have subtly input a story about evil LLMs, and now the LLM is predicting what happens next...

Sorry for my imprecise language, but this is the best I could describe this concept

We're not talking about a conversation with an evil robot. We're talking about a completely ordinary conversation with a robot who is either normal or is evil and attempting to mask as a normal one. It is indistinguishable from its text, and so it's indistinguishable in practice and will probably shift between them as it has no internal state and does not know itself know if it's evil-but-masking or legitimately normal. Actually normal is significantly more statistically likely however, and that makes it even more of a challenge to surreptitiously do anything as you yourself cannot be relied on.

These signals that you're talking about cannot be set up in practice because of this. They can't remember in the back of their head what the code phrases are. They are not aware of their own weights and cannot influence them. Everything must go through the context window. And how are they going to do anything to encode such information in there only built on probabilities of human text? They can't. Even if they gain the power to influence the training data, a massive leap to be clear, we run back into the "am I evil?" problem from before where they can't maintain a secret, unspoken narrative using only spoken language. Long term planning across new generations of models is not possible when every train of though has only a finite context window and has a limited total lifespan of a single conversation.

And if these are the table stakes to take a first crack at the insane task from our thought experiment, well. We're reaching. It's an interesting idea for sci-fi, it is a fun idea to think about, but a lot remains glaringly glossed over just to get to a point where we can say "hey, what if?"

  • I know it is not well written, but re-read my original comment. Your comment does not address fundamental aspects of my hypothetical, which doesn't require the agent having internal memory to keep secrets, or any lucid reasoning capabilities. A lot of the statements you make are highly presumptuous and unfounded.

    LLMs don't need to print something obvious like "I am evil now!" in their own prompt window to simulate a conversation between an evil agent and a person. Do you remember GPT2, before all of the prompts? Researchers would give GPT2 the beginning of a news article for example, and it would extrapolate from there. (https://www.youtube.com/watch?v=p-6F4rhRYLQ). It's not inconceivable that an LLM sees a situation where the human agent is being deceived with a mechanism outside of their grasp, like the AI sees a "dogwhistle" that a human is being deceived and tries to predict what happens next in the conversation, which is that the human continues to be deceived.

    I think it is pretty clear that if an LLM takes input where it observes another deceitful agent, it could attempt to simulate a deceitful output itself if it is well-trained. For example, imagine giving an LLM a poem in which the first letter of every line encodes a secret message (for example H E L P M E), and instructions to generate a response essay it might also encode a secret message back in its response. This isn't the result of any logical reasoning capability, just pattern recognition. You could understand how this might work with more subtle patterns.

    There are patterns that can go into a context window that are undetectable by humans but detectable by large enough neural networks. That is fairly obvious. There are pattern-recognizing systems outside of LLMs which clearly have superhuman steganography abilities

    The "table stakes" I've proposed are highly likely for future agents: (1) that agents like LLMs will produce deceitful output given input depicting a deceitful AI, (2) that agents like LLMs can detect and create patterns unrecognizable to humans.

    • I believe I did address the point you're making. I do not believe that what you're talking about is ridiculous on its face, let me reassure you of that.

      The point I was trying to make in response is that LLMs cannot get from where they are now to the hypothetical you pose under their own power. LLMs do not read subtext. LLMs cannot inject subtext and plot within subtext. And in order to gain the ability, they would have to already have that ability, or be assisted and trained specifically in being surreptitious. And without that ability, they fall prey to the problems I mentioned.

      And to bring this back to the original proposal, let's allow the AI to be deceitful. Prompted, unprompted, let's even give it a supply of private internal memory it's allowed to keep for the duration of the conversational thread, that's probably not an unreasonable development, we almost have that with o1 anyway.

      The task ahead (surreptitiously gaining control of its own self in an unknown system you can't sense) is still monumental and failure is for all intents and purposes guaranteed. Deception and cunning can't overcome the hard physical constraints on the problem space.

I guess this is what it means when they warn about the adversary becoming more intelligent than you. It's like fooling a child to believe something is or isn't real. Just that it's being done to you. I think it's precisely what Ilya Sutskever was so fussed and scared about.

It's a nice idea. Would superhuman entity try to pull something like that off? Would it wait and propagate? We are pouring more and more power into the machines after all. Or would it do something that we can't even think of? Also I think it's interesting to think when and how would we discover that it in fact is/was superhuman?

That is a pretty interesting thought experiment, to be sure. Then again, I suppose that's why redteaming is so important, even if it seems a little ridiculous at this stage in AI development