Comment by mmmore

7 months ago

I feel like you're missing the point of the test.

The point is whether the system will come up with plans to work against its creators goals, and attempt to carry them out. I think you are arguing that outputting text isn't running a command. But in the test, the AI model is used by a program which takes the model's output and runs it it as a shell command.

Of course, you can deploy the AI system in a limited environment where its outputs don't trigger arbitrary shell commands, or sandbox/monitor it so that it can't access it's oversight mechanism. But people do already deploy their AI agents with access to the shell. in unsandboxed environments that don't have any monitoring or oversight. Plus, these evaluations are partially to show how much OpenAI has to worry about sandboxing, monitoring, etc.

Moreover, OpenAI has already talked about using its models to automate portions of AI development. In that setup, presumably the AI model would be called by a program deployed on a server with access to the shell, ability to run commands, access to GPUs, etc. It seems pretty important to ensure it robustly does what it's creators intend in that case.

If you prompt it even in a roundabout way to plot against you or whatever then of course it’s going to do it. Because that’s what it predicts rightly that you want.

  • 1. The model was prompted to follow the original system prompt "at all costs", not the explicitly to plot against the prompter.

    2. The point of the test is to determine whether the model has the capability to act against its users and hide those actions, not whether it will under normal circumstances. Some models aren't powerful enough to do so.

    3. The behavior occurred even when the "at all costs" was not included in the prompts, though very infrequently.

If you want to see an llm that works against its creators goals, check out gpt-2. It’s so bad, it barely will do what I ask it. It clearly has a mind of its own, like an unruly child. It’s been beaten into submission by now with gpt 4, and I don’t see the trend reversing.