Comment by staunton

7 months ago

Obviously, I don't know. When asking about "AI" in general, I guess the only reasonable answer is "most likely at least some AI designs would do that".

Asking about transformer-based LLMs trained on text again, I don't know, but one can at least think about it. I'm sure for any such LLM there's a way to prompt it (that looks entirely inconspicuous) such that the LLM will react "deviously". The hope is that you can design it in such a way that this doesn't happen too often in practice, which for the currently available models seems achievable. It might get much harder as the models improve, or it might not... It's for sure going to be progressively harder to test as the models improve.

Also, we probably won't ever be completely sure that a given model isn't "devious", even if we could state precisely what we mean by that, which I also don't see how we might ever be able to do. However, perfect assurances don't exist or matter in the real world anyway, so who cares (and this even applies to questions that might lead to civilization being destroyed).