← Back to context

Comment by drdeca

1 day ago

While restricting these language models from providing information people already know that can be used for harm, is probably not particularly helpful, I do think having the technical ability to make them decline to do so, could potentially be beneficial and important in the future.

If, in the future, such models, or successors to such models, are able to plan actions better than people can, it would probably be good to prevent these models from making and providing plans to achieve some harmful end which are more effective at achieving that end than a human could come up with.

Now, maybe they will never be capable of better planning in that way.

But if they will be, it seems better to know ahead of time how to make sure they don’t make and provide such plans?

Whether the current practice of trying to make sure they don’t provide certain kinds of information is helpful to that end of “knowing ahead of time how to make sure they don’t make and provide such plans” (under the assumption that some future models will be capable of superhuman planning), is a question that I don’t have a confident answer to.

Still, for the time being, perhaps after finding a truly jailbreakproof method, perhaps the best response is to, after thoroughly verifying that it is jailbreakproof, is to stop using it and let people get whatever answers they want, until closer to when it becomes actually necessary (due to the greater-planning-capabilities approaching).