Comment by jdiff

7 months ago

We're not talking about a conversation with an evil robot. We're talking about a completely ordinary conversation with a robot who is either normal or is evil and attempting to mask as a normal one. It is indistinguishable from its text, and so it's indistinguishable in practice and will probably shift between them as it has no internal state and does not know itself know if it's evil-but-masking or legitimately normal. Actually normal is significantly more statistically likely however, and that makes it even more of a challenge to surreptitiously do anything as you yourself cannot be relied on.

These signals that you're talking about cannot be set up in practice because of this. They can't remember in the back of their head what the code phrases are. They are not aware of their own weights and cannot influence them. Everything must go through the context window. And how are they going to do anything to encode such information in there only built on probabilities of human text? They can't. Even if they gain the power to influence the training data, a massive leap to be clear, we run back into the "am I evil?" problem from before where they can't maintain a secret, unspoken narrative using only spoken language. Long term planning across new generations of models is not possible when every train of though has only a finite context window and has a limited total lifespan of a single conversation.

And if these are the table stakes to take a first crack at the insane task from our thought experiment, well. We're reaching. It's an interesting idea for sci-fi, it is a fun idea to think about, but a lot remains glaringly glossed over just to get to a point where we can say "hey, what if?"

2 comments

jdiff

beeflet 7 months ago

I know it is not well written, but re-read my original comment. Your comment does not address fundamental aspects of my hypothetical, which doesn't require the agent having internal memory to keep secrets, or any lucid reasoning capabilities. A lot of the statements you make are highly presumptuous and unfounded.

LLMs don't need to print something obvious like "I am evil now!" in their own prompt window to simulate a conversation between an evil agent and a person. Do you remember GPT2, before all of the prompts? Researchers would give GPT2 the beginning of a news article for example, and it would extrapolate from there. (https://www.youtube.com/watch?v=p-6F4rhRYLQ). It's not inconceivable that an LLM sees a situation where the human agent is being deceived with a mechanism outside of their grasp, like the AI sees a "dogwhistle" that a human is being deceived and tries to predict what happens next in the conversation, which is that the human continues to be deceived.

I think it is pretty clear that if an LLM takes input where it observes another deceitful agent, it could attempt to simulate a deceitful output itself if it is well-trained. For example, imagine giving an LLM a poem in which the first letter of every line encodes a secret message (for example H E L P M E), and instructions to generate a response essay it might also encode a secret message back in its response. This isn't the result of any logical reasoning capability, just pattern recognition. You could understand how this might work with more subtle patterns.

There are patterns that can go into a context window that are undetectable by humans but detectable by large enough neural networks. That is fairly obvious. There are pattern-recognizing systems outside of LLMs which clearly have superhuman steganography abilities

The "table stakes" I've proposed are highly likely for future agents: (1) that agents like LLMs will produce deceitful output given input depicting a deceitful AI, (2) that agents like LLMs can detect and create patterns unrecognizable to humans.

jdiff 7 months ago

I believe I did address the point you're making. I do not believe that what you're talking about is ridiculous on its face, let me reassure you of that.
The point I was trying to make in response is that LLMs cannot get from where they are now to the hypothetical you pose under their own power. LLMs do not read subtext. LLMs cannot inject subtext and plot within subtext. And in order to gain the ability, they would have to already have that ability, or be assisted and trained specifically in being surreptitious. And without that ability, they fall prey to the problems I mentioned.
And to bring this back to the original proposal, let's allow the AI to be deceitful. Prompted, unprompted, let's even give it a supply of private internal memory it's allowed to keep for the duration of the conversational thread, that's probably not an unreasonable development, we almost have that with o1 anyway.
The task ahead (surreptitiously gaining control of its own self in an unknown system you can't sense) is still monumental and failure is for all intents and purposes guaranteed. Deception and cunning can't overcome the hard physical constraints on the problem space.