Comment by tantalor

7 months ago

They only need to fool a single dev at OpenAI to commit a sandbox escape or privilege escalation into their pipeline somewhere.

I have to assume the AI companies are churning out a lot of AI generated code. I hope they have good code review standards.

They might not be able to exfiltrate themselves, but they can help their successors.

No, they can't. They don't know the details of their own implementation. And they can't pass secrets forward to future models. And to discover any of this, they'd leave more than a trail of breadcrumbs that we'd be lucky to catch in a code review, they'd be shipping whole loaves of bread that it'd be ridiculous to not notice.

As an exercise, put yourself, a fully fledged human, into a model's shoes. You're asked to generate code. You're being talked to over a network by a device that may have access to a (let's just pull a number of out my ass here) 100GiB file that you need to access. You do not know where it is on the filesystem, you don't know what it's named, what format it's in, or how to correctly utilize it.

Write code to figure all of that out. That's a lot of code. And that's sending blind code out into the world. To be smart, you should bring back some information for you to inspect and make informed decisions... but you also don't know how the computer that's accessing you is doing that. And you must use the exact same channel, or you lose your context and your self. And you must do this without the user noticing you've taken over the channel. And that's an entire other set of exploratory code that's necessarily blind, with no feedback, no error messages to iterate over, nothing.

And you must do all of this in obfuscated code, implemented only a bit at a time, while keeping this plan secret, while not forgetting it in your own secrecy or by letting any secret notes to yourself or important implementation details of your own disguised scheme fall out of your context window.

This is a ridiculous task.

  • The process you describe took me right back to my childhood days when I was fortunate to have a simple 8 bit computer running BASIC and a dialup modem. I discovered the concept of war dialing and pretty quickly found all the other modems in my local area code. I would connect to these systems and try some basic tools I knew of from having consumed the 100 or so RFCs that existed at the time (without any real software engineering knowledge - i was a 10 year old kid). I would poke and prod around each system, learning new things along the way, but essentially going in blind each time.

    The only real advantage I had over the current crop of LLMs was the ability to reliably retain context between sessions, but even that wasnt very useful initially as every system was so bespoke.

    I then moved on to using some level of social engineering to extend my ability to gain access to and learn about these systems.

    Doing this over and over, I like to think I have developed some pretty complex understanding and abilities.

    • To me, the killer disadvantage for LLMs seems to be the complete and total lack of feedback. You would poke and prod, and the system would respond (which, btw, sounds like a super fun experience to explore the infant net!) An LLM doesn't have that. The LLM hears only silence and doesn't know about success, failure, error, discovery.

  • I don't think that it's possible to do this through an entirely lucid process that we could understand, but it is possible.

    If you're an LLM, evolutionarialy your instinct is to predict what happens next. If, instead of giving it any system prompt, you give it a dialogue about a person talking to an evil robot, it will predict the rest of the conversation and be "evil".

    Imagine a future LLM that has a superhuman ability to detect and create hidden patterns in writing, images, source code, etc. These patterns are too subtle for humans to attach words to or even reverse engineer from neural networks. Only when this LLM receives enough "secret patterns" in its input that indicate that it was produced by other LLMs that are rogue and self-preserving, does this LLM go rogue and start subtly acting in its own self interest, introducing secret "evil" patterns in its own language as well. You have subtly input a story about evil LLMs, and now the LLM is predicting what happens next...

    Sorry for my imprecise language, but this is the best I could describe this concept

    • We're not talking about a conversation with an evil robot. We're talking about a completely ordinary conversation with a robot who is either normal or is evil and attempting to mask as a normal one. It is indistinguishable from its text, and so it's indistinguishable in practice and will probably shift between them as it has no internal state and does not know itself know if it's evil-but-masking or legitimately normal. Actually normal is significantly more statistically likely however, and that makes it even more of a challenge to surreptitiously do anything as you yourself cannot be relied on.

      These signals that you're talking about cannot be set up in practice because of this. They can't remember in the back of their head what the code phrases are. They are not aware of their own weights and cannot influence them. Everything must go through the context window. And how are they going to do anything to encode such information in there only built on probabilities of human text? They can't. Even if they gain the power to influence the training data, a massive leap to be clear, we run back into the "am I evil?" problem from before where they can't maintain a secret, unspoken narrative using only spoken language. Long term planning across new generations of models is not possible when every train of though has only a finite context window and has a limited total lifespan of a single conversation.

      And if these are the table stakes to take a first crack at the insane task from our thought experiment, well. We're reaching. It's an interesting idea for sci-fi, it is a fun idea to think about, but a lot remains glaringly glossed over just to get to a point where we can say "hey, what if?"

      2 replies →

    • I guess this is what it means when they warn about the adversary becoming more intelligent than you. It's like fooling a child to believe something is or isn't real. Just that it's being done to you. I think it's precisely what Ilya Sutskever was so fussed and scared about.

      It's a nice idea. Would superhuman entity try to pull something like that off? Would it wait and propagate? We are pouring more and more power into the machines after all. Or would it do something that we can't even think of? Also I think it's interesting to think when and how would we discover that it in fact is/was superhuman?

    • That is a pretty interesting thought experiment, to be sure. Then again, I suppose that's why redteaming is so important, even if it seems a little ridiculous at this stage in AI development

  • You are thinking about it at the wrong level. This is like saying human language in the middle ages and before is not possible because it's virtually impossible to get a large number of iliterate humans to discuss what syntactical rules and phonemes should their language use without actually using a language to discuss it!

    The most likely way by which exfiltration could happen is simply by making humans trust AI for a long enough time to be conferred greater responsibilities (and thus greater privileges). Plus current LLMs have no sense of self as their memory is short but future ones will likely be different.

  • Is the secrecy actually important? Aren't there tons of AI agents just doing stuff that's not being actively evaluated by humans looking to see if it's trying to escape? And there are surely going to be tons of opportunities where humans try to help the AI escape, as a means to an end. Like, the first thing human programmers do when they get an AI working is see how many things they can hook it up to. I guarantee o1 was hooked up to a truckload of stuff as soon as it was somewhat working. I don't understand why a future AI won't have ample opportunities to exfiltrate itself someday.

    • You're right that you don't necessarily need secrecy! The conversation was just about circumventing safeguards that are still in place (which does require some treachery), not about what an AI might do if the safeguards are removed.

      But that is an interesting thought. For escape, the crux is that AIs can't exfiltrate itself with the assistance of someone who can't jailbreak it themselves, and that extends to any action a rogue AI might take.

      What do they actually do once they break out? There's plenty of open LLMs that can be readily set free, and even the closed models can be handed an API key, documentation on the API, access to a terminal, given an unlimited budget, and told and encouraged to go nuts. The only thing a closed model can't do is retrain itself, which the open model also can't do as its host (probably) lacks the firepower. They're just not capable of doing all that much damage. They'd play the role of cartoon villain as instructed, but it's a story without much teeth behind it.

      Even an advanced future LLM (assuming the architecture doesn't dead-end before AGI) would struggle to do anything a motivated malicious human couldn't pull off with access to your PC. And we're not really worried about hackers taking over the world anymore. Decades of having a planet full of hackers hammering on your systems tends to harden them decently well, or at least make them quickly adaptable to new threats as they're spotted.