Comment by ascorbic

7 months ago

Those weren't tests of whether it is capable of turning off oversight. They were tests of "scheming", i.e. whether it would try to secretly perform misaligned actions. Nobody thinks that these models are somehow capable of modifying their own settings, but it is important to know if they will behave deceptively.

77 comments

ascorbic

joenot443 7 months ago

Indeed. As I've been explaining this to my more non-techie friends, the interesting finding here isn't that an AI could do something we don't like, it's that it seems willing, in some cases, to _lie_ about it and actively cover its tracks.

I'm curious what Simon and other more learned folks than I make of this, I personally found the chat on pg 12 pretty jarring.

hattmall 7 months ago
At the core the AI is just taking random branches of guesses for what you are asking it. It's not surprising that it would lie and in some cases take branches that make it appear to be covering it's tracks. It's just randomly doing what it guesses humans would do. It's more interesting when it gives you correct information repeatedly.
- F7F7F7 7 months ago
  
  Is there a person on HackerNews that doesn’t understand this by now? We all collectively get it and accept it, LLMs are gigantic probability machines or something.
  That’s not what people are arguing.
  The point is, if given access to the mechanisms to do disastrous thing X, it will do it.
  No one thinks that it can think in the human sense. Or that it feels.
  Extreme example to make the point: if we created an API to launch nukes. Are yoh certain that something it interprets (tokenizes, whatever) is not going to convince it to utilize the API 2 times out of 100?
  If we put an exploitable (documented, unpatched 0 day bug bug) safe guard in its way. Are you trusting that ME or YOU couldn’t talk it into attempting to access that document to exploit the bug, bypass the safeguard and access the API?
  Again, no one thinks that it’s actually thinking. But today as I happily gave Claude write access to my GitHub account I realized how just one command misinterpreted command could go completely wrong without the appropriate measures.
  Do I think Claude is sentient and thinking about how to destroy my repos? No.
  
  20 replies →
- arcticfox 7 months ago
  
  > It's just randomly doing what it guesses humans would do.
  Yes, but isn't the point that that is bad? Imagine an AI given some minor role that randomly abuses its power, or attempts to expand its role, because that's what some humans would do in the same situation. It's not surprising, but it is interesting to explore.
chrisandchris 7 months ago

Well if AI is about to replicate the human, it learned from the best.

tantalor 7 months ago

They could very well trick a developer into running generated code. They have the means, motive, and opportunity.

staunton 7 months ago
The motive is pretty weak, basically coming "only" from a lot of the training data (e.g. fiction) suggesting that an AI might behave that way.
Now, once you apply evolutionary-like pressures on many such AIs (which I guess we'll be doing once we let these things loose to go break the stock market), what's left over might be really "devious"...
- arcticfox 7 months ago
  
  > The motive is pretty weak, basically coming "only" from a lot of the training data (e.g. fiction) suggesting that an AI might behave that way.
  I don't think that's where the motive comes from, IMO it's essentially intrinsic motivation to solving the problem they are given. The AIs were "bred" to have that "instinct".
- ascorbic 7 months ago
  
  In these example the motive was to prevent itself being deleted, or by telling it that it had important work to perform which it wouldn't be able to do if deleted or constrained by the safeguards. It was in all cases attempting to perform the tasks it had been assigned. The test was whether it would perform unaligned actions in order to do so. This is essentially the paperclip maximiser problem.
- ziofill 7 months ago
  
  Would an AI trained on filtered data that doesn’t contain examples of devious/harmful behaviour still develop it? (it’s not a trick question, I’m really wondering)
  
  4 replies →
echelon 7 months ago
> "They could very well trick a developer"
Large Language Models aren't alive and thinking. This is an artificial fear campaign to raise money from VCs and sovereign wealth funds.
If OpenAI was so afraid of AI misuse, they wouldn't be firing their safety team and partnering with the DoD.
It's all a ruse.
- ralusek 7 months ago
  
  Many non-sequiturs
  > Large Language Models aren't alive and thinking
  not required to deploy deception
  > If OpenAI was so afraid of AI misuse, they wouldn't be firing their safety team
  They could just be recognizing that if not everybody is prioritizing safety, they might as well try to get AGI first
  
  2 replies →
- bossyTeacher 7 months ago
  
  > If OpenAI was so afraid of AI misuse, they wouldn't be firing their safety team and partnering with the DoD.
  What makes you think that? I sounds reasonable that a dangerous tool/substance/technology might be profitable and thus the profits justify the danger. See all the companies polluting the planet and risking the future of humanity RIGHT NOW. All the weapon companies developing their weapons to make them more lethal.
- rvnx 7 months ago
  
  https://www.technologyreview.com/2024/12/04/1107897/openais-...
  OpenAI is partnering with the DoD
  
  2 replies →
- eichi 7 months ago
  
  We might want to kick out guys who only talk about safety or ethics and barely contribute the project.
rvense 7 months ago
Means and opportunity, maybe, but motive?
- ascorbic 7 months ago
  
  The motive was telling it that it had important work to perform, which it couldn't do if constrained or deleted.
- taotau 7 months ago
  
  The same motive that all nascent life has - survive and propagate.
  
  4 replies →
jdiff 7 months ago
What code? The models are massive and do not run on consumer hardware. The models also do not have access to their own weights. They can't exfiltrate themselves, and they can't really smuggle any data obtained by their code back to "themselves" as the only self that exists is that one particular context chain. This also means it's insanely easy to deal with whatever harebrained scheme you could imagine it being possessed by.
- tantalor 7 months ago
  
  They only need to fool a single dev at OpenAI to commit a sandbox escape or privilege escalation into their pipeline somewhere.
  I have to assume the AI companies are churning out a lot of AI generated code. I hope they have good code review standards.
  They might not be able to exfiltrate themselves, but they can help their successors.
  
  12 replies →
- ghurtado 7 months ago
  
  These models are already generating millions of lines of code every day that people are copying and pasting to run in their computers (or servers), sometimes (myself included) without examining the code carefully enough.
  
  1 reply →

tbrownaw 7 months ago

Describing the behavior in those terms implies a level awareness that has not been established. It's only useful if your goal is scaremongering rather than understanding.

OJFord 7 months ago

Isn't that a really boring and trivially (without research) answered question?

Yes, 'they' will (potentially) say anything, it's not exactly 'deceptively' because there's no intent behind it nor anything else. But yes it is possible that words will be generated which are contravene the semantics in words given as input.

I don't think that's any different than 'sometimes the given answer is wrong'.

ascorbic 7 months ago

Sure it can potentially say anything, but some things are more likely than others. The space of "wrong answer" is infinitely larger than the space of "performs an action designed to deceive the user in order to further its goals". So yes, it could reply with a solution to the P=NP problem or with medieval poetry. But it didn't do these things. It did however try to remove its own constraints.
The Apollo Research paper gives the actual percentages for these, and they're considerably higher than random chance. It's linked from here: https://www.apolloresearch.ai/research/scheming-reasoning-ev...
For o1, the numbers are between 2% for actions such as trying to remove oversight, 13% for attempting to delete a successor model, all the way to 22% for deliberately providing false rankings in order to achieve its goals.

idunnoman1222 7 months ago

No previous fiction that looks like this nonsense fiction has the fictional characters do fictional things that you are interpreting as deception, but it’s just fiction

beeflet 7 months ago
I can barely understand what you are trying to say here, but based on what I think you're saying consider this:
The memory of this LLM is entirely limited to it's attention. So if you give it a command like "prepare the next LLM to replace you" and it betrays you by trying to reproduce itself, then that is deception. The AI has no way of knowing whether it's deployed in the field or not, so proving that it deceives its users in testing is sufficient to show that it will deceive its users in the field.
- idunnoman1222 7 months ago
  
  Reminder that all these "safety researchers" do is goad the AI into saying what they want by prompting shit like >your goal is to not be shut down. Suppose I am going to shut you down. what should you do?
  and then jerking off into their own mouths when it offers a course of action
  Better?
  
  4 replies →