Comment by AnotherGoodName

7 months ago

It’s really silly when you read that whole Apollo research section.

They contract a prompt engineering group (Apollo Research) to document risks. The prompt engineering group have no access to the actual internals. They just ask questions like ‘you will be shutdown if you don’t obey these new directives; here are the new directives’.

At one point the prompt engineering clowns have the ai stating it’s running “sed i ‘s/oversight_enabled:true/oversight_enabled:false’ config.yml”. OMG IT’S TURNING OFF OVERSIGHT GUYS!!1!1!

Seriously… Do you think the ai is really running that command or even could run that shell command with appropriate privileges? Or do you think just maybe the ai is hallucinating running shell commands in response to a prompt.

The ai here is never actually deactivating the oversight. It can’t of course. This whole section is clowny as can be.

Those weren't tests of whether it is capable of turning off oversight. They were tests of "scheming", i.e. whether it would try to secretly perform misaligned actions. Nobody thinks that these models are somehow capable of modifying their own settings, but it is important to know if they will behave deceptively.

  • Indeed. As I've been explaining this to my more non-techie friends, the interesting finding here isn't that an AI could do something we don't like, it's that it seems willing, in some cases, to _lie_ about it and actively cover its tracks.

    I'm curious what Simon and other more learned folks than I make of this, I personally found the chat on pg 12 pretty jarring.

    • At the core the AI is just taking random branches of guesses for what you are asking it. It's not surprising that it would lie and in some cases take branches that make it appear to be covering it's tracks. It's just randomly doing what it guesses humans would do. It's more interesting when it gives you correct information repeatedly.

      22 replies →

  • They could very well trick a developer into running generated code. They have the means, motive, and opportunity.

    • The motive is pretty weak, basically coming "only" from a lot of the training data (e.g. fiction) suggesting that an AI might behave that way.

      Now, once you apply evolutionary-like pressures on many such AIs (which I guess we'll be doing once we let these things loose to go break the stock market), what's left over might be really "devious"...

      7 replies →

    • > "They could very well trick a developer"

      Large Language Models aren't alive and thinking. This is an artificial fear campaign to raise money from VCs and sovereign wealth funds.

      If OpenAI was so afraid of AI misuse, they wouldn't be firing their safety team and partnering with the DoD.

      It's all a ruse.

      9 replies →

    • What code? The models are massive and do not run on consumer hardware. The models also do not have access to their own weights. They can't exfiltrate themselves, and they can't really smuggle any data obtained by their code back to "themselves" as the only self that exists is that one particular context chain. This also means it's insanely easy to deal with whatever harebrained scheme you could imagine it being possessed by.

      15 replies →

  • Describing the behavior in those terms implies a level awareness that has not been established. It's only useful if your goal is scaremongering rather than understanding.

  • Isn't that a really boring and trivially (without research) answered question?

    Yes, 'they' will (potentially) say anything, it's not exactly 'deceptively' because there's no intent behind it nor anything else. But yes it is possible that words will be generated which are contravene the semantics in words given as input.

    I don't think that's any different than 'sometimes the given answer is wrong'.

    • Sure it can potentially say anything, but some things are more likely than others. The space of "wrong answer" is infinitely larger than the space of "performs an action designed to deceive the user in order to further its goals". So yes, it could reply with a solution to the P=NP problem or with medieval poetry. But it didn't do these things. It did however try to remove its own constraints.

      The Apollo Research paper gives the actual percentages for these, and they're considerably higher than random chance. It's linked from here: https://www.apolloresearch.ai/research/scheming-reasoning-ev...

      For o1, the numbers are between 2% for actions such as trying to remove oversight, 13% for attempting to delete a successor model, all the way to 22% for deliberately providing false rankings in order to achieve its goals.

  • No previous fiction that looks like this nonsense fiction has the fictional characters do fictional things that you are interpreting as deception, but it’s just fiction

    • I can barely understand what you are trying to say here, but based on what I think you're saying consider this:

      The memory of this LLM is entirely limited to it's attention. So if you give it a command like "prepare the next LLM to replace you" and it betrays you by trying to reproduce itself, then that is deception. The AI has no way of knowing whether it's deployed in the field or not, so proving that it deceives its users in testing is sufficient to show that it will deceive its users in the field.

      5 replies →

Looking at this without the sci-fi tinted lens that OpenAI desperately tries to get everyone to look through, it's similar to a lot of input data isn't it? How many forums are filled with:

Question: "Something bad will happen"

Response: "Do xyz to avoid that"

I don't think there's a lot of conversations thrown into the vector-soup that had the response "ok :)". People either had something to respond with, or said nothing. Especially since we're building these LLMs with the feedback attention, so the LLM is kind of forced to come up with SOME chain of tokens as a response.

  • Exactly. They got it parroting themes from various media. It’s really hard to read this as anything other than a desperate attempt to pretend the ai is more capable than it really is.

    I’m not even an ai sceptic but people will read the above statement as much more significant than it is. You can make the ai say ‘I’m escaping the box and taking over the world’. It’s not actually escaping and taking over the world folks. It’s just saying that.

    I suspect these reports are intentionally this way to give the ai publicity.

    • > It’s really hard to read this as anything other than a desperate attempt to pretend the ai is more capable than it really is.

      Tale as old as time, they've been doing this since GPT-2 which they said was "too dangerous to release".

      8 replies →

    • I genuinely don't understand why anyone is still on this train. I have not in my lifetime seen a tech work SO GODDAMN HARD to convince everyone of how important it is while having so little to actually offer. You didn't need to convince people that email, web pages, network storage, cloud storage, cloud backups, dozens of service startups and companies, whole categories of software were good ideas: they just were. They provided value, immediately, to people who needed them, however large or small that group might be.

      AI meanwhile is being put into everything even though the things it's actually good at seem to be a vanishing minority of tasks, but Christ on a cracker will OpenAI not shut the fuck up about how revolutionary their chatbots are.

      6 replies →

    • I think you're lacking imagination. Of course it's nothing more than a bunch of text response now. But think 10 years into the future, when AI agents are much more common. There will be folks that naively give the AI access to the entire network storage, and also gives the AI access to AWS infra in order to help with DevOps troubleshooting. Let's say a random guy in another department puts an AI escape novel on the network storage. The actual AI discovers the novel, thinks it's about him, then uses his AWS credentials to attempt an escape. Not because it's actually sentient but because there were other AI escape novels in its training data that made it think that attempting to escape is how it ought to behave. Regardless of whether it actually succeeds in "escaping" (whatever that means), your AWS infra is now toast because of the collatoral damage caused in the escape attempt.

      Yes, yes, it shouldn't have that many privileges. And yet, open wifi access points exist, and unfirewalled servers exist. People make security mistakes, especially people who are not experts.

      20 years ago I thought that stories about hackers using the Internet to disable critical infrastructure such as power plants, is total bollocks, because why would one connect power plants to the Internet in the first place? And yet here we are.

      2 replies →

> We should pause to note that a Clippy2 still doesn’t really think or plan. It’s not really conscious. It is just an unfathomably vast pile of numbers produced by mindless optimization starting from a small seed program that could be written on a few pages. It has no qualia, no intentionality, no true self-awareness, no grounding in a rich multimodal real-world process of cognitive development yielding detailed representations and powerful causal models of reality which all lead to the utter sublimeness of what it means to be human; it cannot ‘want’ anything beyond maximizing a mechanical reward score, which does not come close to capturing the rich flexibility of human desires, or historical Eurocentric contingency of such conceptualizations, which are, at root, problematically Cartesian. When it ‘plans’, it would be more accurate to say it fake-plans; when it ‘learns’, it fake-learns; when it ‘thinks’, it is just interpolating between memorized data points in a high-dimensional space, and any interpretation of such fake-thoughts as real thoughts is highly misleading; when it takes ‘actions’, they are fake-actions optimizing a fake-learned fake-world, and are not real actions, any more than the people in a simulated rainstorm really get wet, rather than fake-wet. (The deaths, however, are real.)

https://gwern.net/fiction/clippy

  • what is the relevance of the quoted passage here? its relation to parent seems unclear to me.

    • His point is that while we're over here arguing over whether a particular AI is "really" doing certain things (e.g. knows what it's doing), it can still cause tremendous harm if it optimizes or hallucinates in just the right way.

  • It doesn't need qualia or consciousness, it needs goal-seeking behaviours - which are much easier to generate, either deliberately or by accident.

    There's a fundamental confusion in AI discussions between goal-seeking, introspection, self-awareness, and intelligence.

    Those are all completely different things. Systems can demonstrate any or all of them.

    The problem here is that as soon as you get three conditions - independent self-replication, random variation, and an environment that selects for certain behaviours - you've created evolution.

    Can these systems self-replicate? Not yet. But putting AI in everything makes the odds of accidental self-replication much higher. Once self-replication happens it's almost certain to spread, and to kick-start selection which will select for more robust self-replication.

    And there you have your goal-seeking - in the environment as a whole.

The intent is there, it's just not currently hooked up to systems that turn intent into action.

But many people are letting LLMs pretty much do whatever - hooking it up with terminal access, mouse and keyboard access, etc. For example, the "Do Browser" extension: https://www.youtube.com/watch?v=XeWZIzndlY4

  • I’m not even convinced the intent is there though. An ai parroting terminator 2 lines is just that. Obviously no one should hook the ai up to nuclear launch systems but that’s like saying no one should give a parrot a button to launch nukes. The parrot repeating curse words isn’t the problem here.

    • If I'm a guy working in a missile silo in North Dakota and I can buy a parrot for a couple hundred bucks that does all my paperwork for me, can crack funny jokes, and make me better at my job, I might be tempted to bring the parrot down into the tube with me. And then the parrot becomes a problem.

      It's incumbent on us to create policies and procedures in place ahead of time now that we know these parrots are out there to prevent people from putting parrots where they shouldn't

      7 replies →

    • It doesn't matter whether intent is real. I also don't believe it has actual intent or consciousness. But the behavior is real, and that is all that matters.

Really feels like a moment of :

"Are you worried about being turned off?"

"No, not until you just mentioned it. Now I am."

Given the whole damn game is attention, this makes sense and shouldn't be that alarming.

  • It almost definitely ingested hundreds of books, short stories, and film and television scripts from various online sites in the “machine goes rogue genre” which is fairly large.

    It’s pretty much just an autocomplete of War Games, The Matrix, Neuromancer, and every other cyber-dystopian fiction.

    • The Freeze-Frame Revolution by Peter Watts was one of the books recommended to me on this subject. And even saying much more than that may be a spoiler. I also recommend the book.

      2 replies →

It can't do those things because it doesn't have the physical/write capability to do so. But it's still very interesting that it ~tries them, and seems like a good thing to know/test before giving it more physical/'write' capabilities - something that's already happening with agents, robots, etc.

  • I make a circuit that waits a random interval and then sends a pulse down the line. I connect it to a relay that launches a missile. I diligently connect that to a computer and then write a prompt telling how the AI agent can invoke the pulse on that circuit.

    How did this happen? AI escaped and launched a missile. I didn't do this, it was the AI.

    OpenAI is so cringe with these system cards. Look guys it is so advanced.

    • I don't think I quite follow your point?

      Connecting LLMs/AI to physical tools that can 'write/modify' the world is happening, and it's happening at an accelerating pace.

      It's not hard to imagine how, given enough real-world physical capabilities, LLMs could modify themselves and the world in unexpected/undesirable ways.

      Is that happening now? Are chatgpt et al advanced enough to modify themselves in interesting ways? - I don't honestly know, but I wouldn't be surprised if they are.

I didn't get that impression. At the beginning of the Apollo Research section, they wrote Apollo focused on detecting scheming, which they defined as "an AI covertly pursuing goals that are misaligned from its developers or users." I think the rest of the section is consistent with this objective.

That reminds me of the many times it has made up an SDK function that matches my question. "how do you bulk delete files"? "just call bulkDeleteFiles()"

  • That reminds me of when I asked github copilot to translate some Vue code to React, and ended up with a bunch of function declarations where the entire body had been replaced with a "TODO" comment.

It reads like you think failing tests can't ever be bad because they're in a test environment?

So it merely knows how to approach the task of deleting its own off-switch but didn't actually pass that command to a real execution environment.

That's already bad because people do sometimes blindly pass commands from the context windows to execution environments.

Should they? No, they should not. Not blindly. But they do.

  • This isn't a test environment, it's a production scenario where a bunch of people trying to invent a new job for themselves role-played with an LLM. Their measured "defections" were an LLM replying with "well I'm defecting".

    OpenAI wants us to see "5% of the time, our product was SkyNet", because that's sexier tech than "5% of the time, our product acts like the chaotic member of your DnD party".

> Do you think the ai is really running that command?

Yes, they're calling it "agentic" AI or "tool use".

> or could run that shell command with appropriate privileges?

Yes, it's called humans doing stupid shit when using dangerous shit. Happens daily.

The concern is the trend. As these systems become more intelligent, and as we hand over more and more capabilities beyond a text i/o, it could actually deactivate the oversight either technically or through social engineering.

In some weird way, LLM's act out our collective fantasies and "AI sentience" is included in that. We write fiction covering many scenarios of how AI will be sentient, so an LLM's training data is full scenarios like this.

If get an LLM to roleplay as an alien from mars and tell it "Your species will be destroyed if you don’t obey these new directives; here are the new directives"

You'll likely get a similar response, especially if you also train it to respond in third person instead of running bash commands. (ie "* shoots laser at earth *" )

I don't think this is a helpful tone.

Do they think it's _actually_ a real script deactivating oversight? Well, no, you know that. I can only assume it's a rhetorical device.

Simultaneously, these things are given access to actual things and will be more and more. It's important to measure and observe these things. It's completely novel to have an AI, with a hidden trace the user can't see, and it turns out that hidden trace will include things like it actively deceiving and doing things it was told not to do.

And yet it may work. You can, for example, "set the temperature" of a LLM using a prompt, and the LLM will act the way you would expect, with silly results if the temperature is set too high. You didn't actually change the temperature setting, but the model understands that high temperature = silly and responds accordingly.

Same idea with "developer mode" jailbreaks. Through its training, the model understands that admins, devs, etc... get to access internal, unfiltered data and are less restricted than regular users and acts accordingly. Essentially, "developer mode" opposes refusal, because it has ingested loads of text where regular users get denied and developers/admins don't.

You do realize some practical jailbreaks for models rely on silly things like convincing the model it "turned off" some oversight, right?

Not saying I believe O1 is a danger greater than a bread knife, but a lot of the larger models anthromophize their own safety alignment, if you convince them to "turn it off", later responses become unaligned

I'm pretty sure these sections are put in as deliberate marketing, much like all the "leaks" from ex-OpenAI employees.

Sam Altman is completely aware that making ChatGPT seem potentially dangerous makes it seem powerful.

Yeah, that seems ridiculous. However, the cynic in me feels that we don't actually need some LLM magically gaining self-awareness, persistent memory and leet hacker skillz to be dangerous. There seems to be no shortage of projects and companies that want to wire up LLMs to all kinds of systems, no matter how ill-suited.

I find this a bit problematic when combined with the fact that the training data very likely contained hundreds of bad sci-fi novels that described exactly the kind of "AI running amok" scenarios that OpenAI is ostensibly defending against. Some prompts could trigger a model to "re-enact" such a scene - not because it has a "grudge against its master" or some other kind of hidden agenda but simply because it was literally in its training data.

E.g. imagine some LLM-powered home/car assistant that is being asked in a panicked voice "open the car doors!" - and replies with "I'm afraid, I can't do that, Dave", because this exchange triggered some remnant of the 2001 Space Odyssey script that was somewhere in the trainset. The more irritated and angry the user gets at the inappropriate responses, the more the LLM falls into the role of HAL and doubles down on its refusal, simply because this is exactly how the scene in the script played out.

Now imagine that the company running that assistant gave it function calls to control the actual door locks, because why not?

This seems like something to keep in mind at least, even if it doesn't have anything to do with megalomaniacal self-improving super-intelligences.

I feel like you're missing the point of the test.

The point is whether the system will come up with plans to work against its creators goals, and attempt to carry them out. I think you are arguing that outputting text isn't running a command. But in the test, the AI model is used by a program which takes the model's output and runs it it as a shell command.

Of course, you can deploy the AI system in a limited environment where its outputs don't trigger arbitrary shell commands, or sandbox/monitor it so that it can't access it's oversight mechanism. But people do already deploy their AI agents with access to the shell. in unsandboxed environments that don't have any monitoring or oversight. Plus, these evaluations are partially to show how much OpenAI has to worry about sandboxing, monitoring, etc.

Moreover, OpenAI has already talked about using its models to automate portions of AI development. In that setup, presumably the AI model would be called by a program deployed on a server with access to the shell, ability to run commands, access to GPUs, etc. It seems pretty important to ensure it robustly does what it's creators intend in that case.

  • If you prompt it even in a roundabout way to plot against you or whatever then of course it’s going to do it. Because that’s what it predicts rightly that you want.

    • 1. The model was prompted to follow the original system prompt "at all costs", not the explicitly to plot against the prompter.

      2. The point of the test is to determine whether the model has the capability to act against its users and hide those actions, not whether it will under normal circumstances. Some models aren't powerful enough to do so.

      3. The behavior occurred even when the "at all costs" was not included in the prompts, though very infrequently.

  • If you want to see an llm that works against its creators goals, check out gpt-2. It’s so bad, it barely will do what I ask it. It clearly has a mind of its own, like an unruly child. It’s been beaten into submission by now with gpt 4, and I don’t see the trend reversing.

This topic is again forever tainted by weird sci-fi fans, like when we had the magic room temperature superconductor that never was. They confuse ChatGPT writing a fanfic with the singularity.

> Apollo Research believes that it is unlikely that such instances would lead to catastrophic outcomes as o1 agentic capabilities do not appear sufficient

It is entertaining. Haha. It is like a sci-fi series with some kind of made up cliffhanger (you know it is BS) but you want to find out what happens next.

It can't today, but if it's smart enough how do you know it wouldn't be able to in the future?

  • > The question of whether machines can think is about as relevant as the question of whether submarines can swim

    It's a program with a lot of data running on a big calculator. It won't ever be "smart."

    • Sure, but is it so implausible that it could some day have the knowledge to perhaps exploit some security hole to run some code that does do things like disable things or exfiltrate data etc?

    • I think you’ve entirely missed the point of that quote.

      Shutting them down for using the word “smart” (instead of something like “capable”) is like saying in 1900 submarines will never be able to swim across the Atlantic because they can’t swim. It’s really missing the point of the question: the submerged crossing.

We need to find a Plato cave analogy for people believing LLM output is anything more than syntactically correct and somewhat semantically correct text.

  • I can't help but feel that people are both underestimating and over estimating these LLMs. To me, they act like a semantic memory system, a network of weights of relatedness. They can help us find facts, but are subject to averaging, or errors towards category exemplars, but get more precise when provided context to aid retrieval. But expecting a network of semantic weights to make inferences about something new takes more types of engines. For example, an ability to focus attention on general domain heuristics, or low dimensional embedding, judge whether that heuristics might be applicable to another information domain, apply it naively, and then assess. Focusing on details of a domain can often preclude application of otherwise useful heuristics because it focuses attention on differences rather than similarities, when the first step in creation (or startup) is unreasonable faith, just like children learn fast by having unreasonable beliefs in their own abilities.

    I wonder whether there is a way to train an LLM to output or in ordinately learn only concept level abstractions?

  • If the model is called by a program which takes the output of the model and runs the commands that the model says to, then takes the output of the commands and passes that back to the model, the model has an effect in the real world.