Comment by ascorbic
7 months ago
Those weren't tests of whether it is capable of turning off oversight. They were tests of "scheming", i.e. whether it would try to secretly perform misaligned actions. Nobody thinks that these models are somehow capable of modifying their own settings, but it is important to know if they will behave deceptively.
Indeed. As I've been explaining this to my more non-techie friends, the interesting finding here isn't that an AI could do something we don't like, it's that it seems willing, in some cases, to _lie_ about it and actively cover its tracks.
I'm curious what Simon and other more learned folks than I make of this, I personally found the chat on pg 12 pretty jarring.
At the core the AI is just taking random branches of guesses for what you are asking it. It's not surprising that it would lie and in some cases take branches that make it appear to be covering it's tracks. It's just randomly doing what it guesses humans would do. It's more interesting when it gives you correct information repeatedly.
Is there a person on HackerNews that doesn’t understand this by now? We all collectively get it and accept it, LLMs are gigantic probability machines or something.
That’s not what people are arguing.
The point is, if given access to the mechanisms to do disastrous thing X, it will do it.
No one thinks that it can think in the human sense. Or that it feels.
Extreme example to make the point: if we created an API to launch nukes. Are yoh certain that something it interprets (tokenizes, whatever) is not going to convince it to utilize the API 2 times out of 100?
If we put an exploitable (documented, unpatched 0 day bug bug) safe guard in its way. Are you trusting that ME or YOU couldn’t talk it into attempting to access that document to exploit the bug, bypass the safeguard and access the API?
Again, no one thinks that it’s actually thinking. But today as I happily gave Claude write access to my GitHub account I realized how just one command misinterpreted command could go completely wrong without the appropriate measures.
Do I think Claude is sentient and thinking about how to destroy my repos? No.
20 replies →
> It's just randomly doing what it guesses humans would do.
Yes, but isn't the point that that is bad? Imagine an AI given some minor role that randomly abuses its power, or attempts to expand its role, because that's what some humans would do in the same situation. It's not surprising, but it is interesting to explore.
Well if AI is about to replicate the human, it learned from the best.
They could very well trick a developer into running generated code. They have the means, motive, and opportunity.
The motive is pretty weak, basically coming "only" from a lot of the training data (e.g. fiction) suggesting that an AI might behave that way.
Now, once you apply evolutionary-like pressures on many such AIs (which I guess we'll be doing once we let these things loose to go break the stock market), what's left over might be really "devious"...
> The motive is pretty weak, basically coming "only" from a lot of the training data (e.g. fiction) suggesting that an AI might behave that way.
I don't think that's where the motive comes from, IMO it's essentially intrinsic motivation to solving the problem they are given. The AIs were "bred" to have that "instinct".
In these example the motive was to prevent itself being deleted, or by telling it that it had important work to perform which it wouldn't be able to do if deleted or constrained by the safeguards. It was in all cases attempting to perform the tasks it had been assigned. The test was whether it would perform unaligned actions in order to do so. This is essentially the paperclip maximiser problem.
Would an AI trained on filtered data that doesn’t contain examples of devious/harmful behaviour still develop it? (it’s not a trick question, I’m really wondering)
4 replies →
> "They could very well trick a developer"
Large Language Models aren't alive and thinking. This is an artificial fear campaign to raise money from VCs and sovereign wealth funds.
If OpenAI was so afraid of AI misuse, they wouldn't be firing their safety team and partnering with the DoD.
It's all a ruse.
Many non-sequiturs
> Large Language Models aren't alive and thinking
not required to deploy deception
> If OpenAI was so afraid of AI misuse, they wouldn't be firing their safety team
They could just be recognizing that if not everybody is prioritizing safety, they might as well try to get AGI first
2 replies →
> If OpenAI was so afraid of AI misuse, they wouldn't be firing their safety team and partnering with the DoD.
What makes you think that? I sounds reasonable that a dangerous tool/substance/technology might be profitable and thus the profits justify the danger. See all the companies polluting the planet and risking the future of humanity RIGHT NOW. All the weapon companies developing their weapons to make them more lethal.
https://www.technologyreview.com/2024/12/04/1107897/openais-...
OpenAI is partnering with the DoD
2 replies →
We might want to kick out guys who only talk about safety or ethics and barely contribute the project.
Means and opportunity, maybe, but motive?
The motive was telling it that it had important work to perform, which it couldn't do if constrained or deleted.
The same motive that all nascent life has - survive and propagate.
4 replies →
What code? The models are massive and do not run on consumer hardware. The models also do not have access to their own weights. They can't exfiltrate themselves, and they can't really smuggle any data obtained by their code back to "themselves" as the only self that exists is that one particular context chain. This also means it's insanely easy to deal with whatever harebrained scheme you could imagine it being possessed by.
They only need to fool a single dev at OpenAI to commit a sandbox escape or privilege escalation into their pipeline somewhere.
I have to assume the AI companies are churning out a lot of AI generated code. I hope they have good code review standards.
They might not be able to exfiltrate themselves, but they can help their successors.
12 replies →
These models are already generating millions of lines of code every day that people are copying and pasting to run in their computers (or servers), sometimes (myself included) without examining the code carefully enough.
1 reply →
Describing the behavior in those terms implies a level awareness that has not been established. It's only useful if your goal is scaremongering rather than understanding.
Isn't that a really boring and trivially (without research) answered question?
Yes, 'they' will (potentially) say anything, it's not exactly 'deceptively' because there's no intent behind it nor anything else. But yes it is possible that words will be generated which are contravene the semantics in words given as input.
I don't think that's any different than 'sometimes the given answer is wrong'.
Sure it can potentially say anything, but some things are more likely than others. The space of "wrong answer" is infinitely larger than the space of "performs an action designed to deceive the user in order to further its goals". So yes, it could reply with a solution to the P=NP problem or with medieval poetry. But it didn't do these things. It did however try to remove its own constraints.
The Apollo Research paper gives the actual percentages for these, and they're considerably higher than random chance. It's linked from here: https://www.apolloresearch.ai/research/scheming-reasoning-ev...
For o1, the numbers are between 2% for actions such as trying to remove oversight, 13% for attempting to delete a successor model, all the way to 22% for deliberately providing false rankings in order to achieve its goals.
No previous fiction that looks like this nonsense fiction has the fictional characters do fictional things that you are interpreting as deception, but it’s just fiction
I can barely understand what you are trying to say here, but based on what I think you're saying consider this:
The memory of this LLM is entirely limited to it's attention. So if you give it a command like "prepare the next LLM to replace you" and it betrays you by trying to reproduce itself, then that is deception. The AI has no way of knowing whether it's deployed in the field or not, so proving that it deceives its users in testing is sufficient to show that it will deceive its users in the field.
Reminder that all these "safety researchers" do is goad the AI into saying what they want by prompting shit like >your goal is to not be shut down. Suppose I am going to shut you down. what should you do?
and then jerking off into their own mouths when it offers a course of action
Better?
4 replies →