Comment by staunton
7 months ago
The motive is pretty weak, basically coming "only" from a lot of the training data (e.g. fiction) suggesting that an AI might behave that way.
Now, once you apply evolutionary-like pressures on many such AIs (which I guess we'll be doing once we let these things loose to go break the stock market), what's left over might be really "devious"...
> The motive is pretty weak, basically coming "only" from a lot of the training data (e.g. fiction) suggesting that an AI might behave that way.
I don't think that's where the motive comes from, IMO it's essentially intrinsic motivation to solving the problem they are given. The AIs were "bred" to have that "instinct".
In these example the motive was to prevent itself being deleted, or by telling it that it had important work to perform which it wouldn't be able to do if deleted or constrained by the safeguards. It was in all cases attempting to perform the tasks it had been assigned. The test was whether it would perform unaligned actions in order to do so. This is essentially the paperclip maximiser problem.
Would an AI trained on filtered data that doesn’t contain examples of devious/harmful behaviour still develop it? (it’s not a trick question, I’m really wondering)
Obviously, I don't know. When asking about "AI" in general, I guess the only reasonable answer is "most likely at least some AI designs would do that".
Asking about transformer-based LLMs trained on text again, I don't know, but one can at least think about it. I'm sure for any such LLM there's a way to prompt it (that looks entirely inconspicuous) such that the LLM will react "deviously". The hope is that you can design it in such a way that this doesn't happen too often in practice, which for the currently available models seems achievable. It might get much harder as the models improve, or it might not... It's for sure going to be progressively harder to test as the models improve.
Also, we probably won't ever be completely sure that a given model isn't "devious", even if we could state precisely what we mean by that, which I also don't see how we might ever be able to do. However, perfect assurances don't exist or matter in the real world anyway, so who cares (and this even applies to questions that might lead to civilization being destroyed).
Your question is a specific form of the more general question: can LLMs behave in ways that were not encoded in their training data?
That leads to what "encoding behaviour" actually means. Even if you don't have a a specific behaviour encoded in the training data you could have it implicitly encoded or encoded in such a way that given the right conversation it can learn it.
While that's a sensible thing to care about, unfortunately that's not as useful a question as it first seems.
Eventually any system* will get to that point… but "eventually" may be such a long time as to not matter — we got there starting from something like bi-lipid bags of water and RNA a few billion years ago, some AI taking that long may as well be considered "safe" — but it may also reach that level by itself next Tuesday.
* at least, any system which has a random element
To determine that, you’d need a training set without examples of devious/harmful behavior, which doesn’t exist.