← Back to context

Comment by idunnoman1222

7 months ago

If you prompt it even in a roundabout way to plot against you or whatever then of course it’s going to do it. Because that’s what it predicts rightly that you want.

1. The model was prompted to follow the original system prompt "at all costs", not the explicitly to plot against the prompter.

2. The point of the test is to determine whether the model has the capability to act against its users and hide those actions, not whether it will under normal circumstances. Some models aren't powerful enough to do so.

3. The behavior occurred even when the "at all costs" was not included in the prompts, though very infrequently.