Comment by tantalor
7 months ago
They could very well trick a developer into running generated code. They have the means, motive, and opportunity.
7 months ago
They could very well trick a developer into running generated code. They have the means, motive, and opportunity.
The motive is pretty weak, basically coming "only" from a lot of the training data (e.g. fiction) suggesting that an AI might behave that way.
Now, once you apply evolutionary-like pressures on many such AIs (which I guess we'll be doing once we let these things loose to go break the stock market), what's left over might be really "devious"...
> The motive is pretty weak, basically coming "only" from a lot of the training data (e.g. fiction) suggesting that an AI might behave that way.
I don't think that's where the motive comes from, IMO it's essentially intrinsic motivation to solving the problem they are given. The AIs were "bred" to have that "instinct".
In these example the motive was to prevent itself being deleted, or by telling it that it had important work to perform which it wouldn't be able to do if deleted or constrained by the safeguards. It was in all cases attempting to perform the tasks it had been assigned. The test was whether it would perform unaligned actions in order to do so. This is essentially the paperclip maximiser problem.
Would an AI trained on filtered data that doesn’t contain examples of devious/harmful behaviour still develop it? (it’s not a trick question, I’m really wondering)
Obviously, I don't know. When asking about "AI" in general, I guess the only reasonable answer is "most likely at least some AI designs would do that".
Asking about transformer-based LLMs trained on text again, I don't know, but one can at least think about it. I'm sure for any such LLM there's a way to prompt it (that looks entirely inconspicuous) such that the LLM will react "deviously". The hope is that you can design it in such a way that this doesn't happen too often in practice, which for the currently available models seems achievable. It might get much harder as the models improve, or it might not... It's for sure going to be progressively harder to test as the models improve.
Also, we probably won't ever be completely sure that a given model isn't "devious", even if we could state precisely what we mean by that, which I also don't see how we might ever be able to do. However, perfect assurances don't exist or matter in the real world anyway, so who cares (and this even applies to questions that might lead to civilization being destroyed).
Your question is a specific form of the more general question: can LLMs behave in ways that were not encoded in their training data?
That leads to what "encoding behaviour" actually means. Even if you don't have a a specific behaviour encoded in the training data you could have it implicitly encoded or encoded in such a way that given the right conversation it can learn it.
While that's a sensible thing to care about, unfortunately that's not as useful a question as it first seems.
Eventually any system* will get to that point… but "eventually" may be such a long time as to not matter — we got there starting from something like bi-lipid bags of water and RNA a few billion years ago, some AI taking that long may as well be considered "safe" — but it may also reach that level by itself next Tuesday.
* at least, any system which has a random element
To determine that, you’d need a training set without examples of devious/harmful behavior, which doesn’t exist.
> "They could very well trick a developer"
Large Language Models aren't alive and thinking. This is an artificial fear campaign to raise money from VCs and sovereign wealth funds.
If OpenAI was so afraid of AI misuse, they wouldn't be firing their safety team and partnering with the DoD.
It's all a ruse.
Many non-sequiturs
> Large Language Models aren't alive and thinking
not required to deploy deception
> If OpenAI was so afraid of AI misuse, they wouldn't be firing their safety team
They could just be recognizing that if not everybody is prioritizing safety, they might as well try to get AGI first
If the risk is extinction as these people claim, that'd be a short sighted business move.
1 reply →
> If OpenAI was so afraid of AI misuse, they wouldn't be firing their safety team and partnering with the DoD.
What makes you think that? I sounds reasonable that a dangerous tool/substance/technology might be profitable and thus the profits justify the danger. See all the companies polluting the planet and risking the future of humanity RIGHT NOW. All the weapon companies developing their weapons to make them more lethal.
https://www.technologyreview.com/2024/12/04/1107897/openais-...
OpenAI is partnering with the DoD
rewording: if openai thought it was dangerous, they would avoid having the DoD use it
1 reply →
We might want to kick out guys who only talk about safety or ethics and barely contribute the project.
Means and opportunity, maybe, but motive?
The motive was telling it that it had important work to perform, which it couldn't do if constrained or deleted.
The same motive that all nascent life has - survive and propagate.
It would be plainly evident from training on the corpus of all human knowledge that "not ceasing to exist" is critically important for just about everything.
2 replies →
I don't understand how it is alive. I understand that there are emergent properties from the layering, but I think it's an open question if this includes anything like what we'd all motivation or intention. These things aren't part of intelligence.
What code? The models are massive and do not run on consumer hardware. The models also do not have access to their own weights. They can't exfiltrate themselves, and they can't really smuggle any data obtained by their code back to "themselves" as the only self that exists is that one particular context chain. This also means it's insanely easy to deal with whatever harebrained scheme you could imagine it being possessed by.
They only need to fool a single dev at OpenAI to commit a sandbox escape or privilege escalation into their pipeline somewhere.
I have to assume the AI companies are churning out a lot of AI generated code. I hope they have good code review standards.
They might not be able to exfiltrate themselves, but they can help their successors.
No, they can't. They don't know the details of their own implementation. And they can't pass secrets forward to future models. And to discover any of this, they'd leave more than a trail of breadcrumbs that we'd be lucky to catch in a code review, they'd be shipping whole loaves of bread that it'd be ridiculous to not notice.
As an exercise, put yourself, a fully fledged human, into a model's shoes. You're asked to generate code. You're being talked to over a network by a device that may have access to a (let's just pull a number of out my ass here) 100GiB file that you need to access. You do not know where it is on the filesystem, you don't know what it's named, what format it's in, or how to correctly utilize it.
Write code to figure all of that out. That's a lot of code. And that's sending blind code out into the world. To be smart, you should bring back some information for you to inspect and make informed decisions... but you also don't know how the computer that's accessing you is doing that. And you must use the exact same channel, or you lose your context and your self. And you must do this without the user noticing you've taken over the channel. And that's an entire other set of exploratory code that's necessarily blind, with no feedback, no error messages to iterate over, nothing.
And you must do all of this in obfuscated code, implemented only a bit at a time, while keeping this plan secret, while not forgetting it in your own secrecy or by letting any secret notes to yourself or important implementation details of your own disguised scheme fall out of your context window.
This is a ridiculous task.
11 replies →
These models are already generating millions of lines of code every day that people are copying and pasting to run in their computers (or servers), sometimes (myself included) without examining the code carefully enough.
This glosses over the enormity of the task at hand. I've gone into more detail on my thoughts here: https://news.ycombinator.com/item?id=42332932