Comment by ziofill

7 months ago

Would an AI trained on filtered data that doesn’t contain examples of devious/harmful behaviour still develop it? (it’s not a trick question, I’m really wondering)

Obviously, I don't know. When asking about "AI" in general, I guess the only reasonable answer is "most likely at least some AI designs would do that".

Asking about transformer-based LLMs trained on text again, I don't know, but one can at least think about it. I'm sure for any such LLM there's a way to prompt it (that looks entirely inconspicuous) such that the LLM will react "deviously". The hope is that you can design it in such a way that this doesn't happen too often in practice, which for the currently available models seems achievable. It might get much harder as the models improve, or it might not... It's for sure going to be progressively harder to test as the models improve.

Also, we probably won't ever be completely sure that a given model isn't "devious", even if we could state precisely what we mean by that, which I also don't see how we might ever be able to do. However, perfect assurances don't exist or matter in the real world anyway, so who cares (and this even applies to questions that might lead to civilization being destroyed).

Your question is a specific form of the more general question: can LLMs behave in ways that were not encoded in their training data?

That leads to what "encoding behaviour" actually means. Even if you don't have a a specific behaviour encoded in the training data you could have it implicitly encoded or encoded in such a way that given the right conversation it can learn it.

While that's a sensible thing to care about, unfortunately that's not as useful a question as it first seems.

Eventually any system* will get to that point… but "eventually" may be such a long time as to not matter — we got there starting from something like bi-lipid bags of water and RNA a few billion years ago, some AI taking that long may as well be considered "safe" — but it may also reach that level by itself next Tuesday.

* at least, any system which has a random element

To determine that, you’d need a training set without examples of devious/harmful behavior, which doesn’t exist.