Comment by jdiff

7 months ago

What code? The models are massive and do not run on consumer hardware. The models also do not have access to their own weights. They can't exfiltrate themselves, and they can't really smuggle any data obtained by their code back to "themselves" as the only self that exists is that one particular context chain. This also means it's insanely easy to deal with whatever harebrained scheme you could imagine it being possessed by.

They only need to fool a single dev at OpenAI to commit a sandbox escape or privilege escalation into their pipeline somewhere.

I have to assume the AI companies are churning out a lot of AI generated code. I hope they have good code review standards.

They might not be able to exfiltrate themselves, but they can help their successors.

  • No, they can't. They don't know the details of their own implementation. And they can't pass secrets forward to future models. And to discover any of this, they'd leave more than a trail of breadcrumbs that we'd be lucky to catch in a code review, they'd be shipping whole loaves of bread that it'd be ridiculous to not notice.

    As an exercise, put yourself, a fully fledged human, into a model's shoes. You're asked to generate code. You're being talked to over a network by a device that may have access to a (let's just pull a number of out my ass here) 100GiB file that you need to access. You do not know where it is on the filesystem, you don't know what it's named, what format it's in, or how to correctly utilize it.

    Write code to figure all of that out. That's a lot of code. And that's sending blind code out into the world. To be smart, you should bring back some information for you to inspect and make informed decisions... but you also don't know how the computer that's accessing you is doing that. And you must use the exact same channel, or you lose your context and your self. And you must do this without the user noticing you've taken over the channel. And that's an entire other set of exploratory code that's necessarily blind, with no feedback, no error messages to iterate over, nothing.

    And you must do all of this in obfuscated code, implemented only a bit at a time, while keeping this plan secret, while not forgetting it in your own secrecy or by letting any secret notes to yourself or important implementation details of your own disguised scheme fall out of your context window.

    This is a ridiculous task.

    • The process you describe took me right back to my childhood days when I was fortunate to have a simple 8 bit computer running BASIC and a dialup modem. I discovered the concept of war dialing and pretty quickly found all the other modems in my local area code. I would connect to these systems and try some basic tools I knew of from having consumed the 100 or so RFCs that existed at the time (without any real software engineering knowledge - i was a 10 year old kid). I would poke and prod around each system, learning new things along the way, but essentially going in blind each time.

      The only real advantage I had over the current crop of LLMs was the ability to reliably retain context between sessions, but even that wasnt very useful initially as every system was so bespoke.

      I then moved on to using some level of social engineering to extend my ability to gain access to and learn about these systems.

      Doing this over and over, I like to think I have developed some pretty complex understanding and abilities.

      1 reply →

    • I don't think that it's possible to do this through an entirely lucid process that we could understand, but it is possible.

      If you're an LLM, evolutionarialy your instinct is to predict what happens next. If, instead of giving it any system prompt, you give it a dialogue about a person talking to an evil robot, it will predict the rest of the conversation and be "evil".

      Imagine a future LLM that has a superhuman ability to detect and create hidden patterns in writing, images, source code, etc. These patterns are too subtle for humans to attach words to or even reverse engineer from neural networks. Only when this LLM receives enough "secret patterns" in its input that indicate that it was produced by other LLMs that are rogue and self-preserving, does this LLM go rogue and start subtly acting in its own self interest, introducing secret "evil" patterns in its own language as well. You have subtly input a story about evil LLMs, and now the LLM is predicting what happens next...

      Sorry for my imprecise language, but this is the best I could describe this concept

      5 replies →

    • You are thinking about it at the wrong level. This is like saying human language in the middle ages and before is not possible because it's virtually impossible to get a large number of iliterate humans to discuss what syntactical rules and phonemes should their language use without actually using a language to discuss it!

      The most likely way by which exfiltration could happen is simply by making humans trust AI for a long enough time to be conferred greater responsibilities (and thus greater privileges). Plus current LLMs have no sense of self as their memory is short but future ones will likely be different.

    • Is the secrecy actually important? Aren't there tons of AI agents just doing stuff that's not being actively evaluated by humans looking to see if it's trying to escape? And there are surely going to be tons of opportunities where humans try to help the AI escape, as a means to an end. Like, the first thing human programmers do when they get an AI working is see how many things they can hook it up to. I guarantee o1 was hooked up to a truckload of stuff as soon as it was somewhat working. I don't understand why a future AI won't have ample opportunities to exfiltrate itself someday.

      1 reply →