Comment by mmmore
7 months ago
1. The model was prompted to follow the original system prompt "at all costs", not the explicitly to plot against the prompter.
2. The point of the test is to determine whether the model has the capability to act against its users and hide those actions, not whether it will under normal circumstances. Some models aren't powerful enough to do so.
3. The behavior occurred even when the "at all costs" was not included in the prompts, though very infrequently.
No comments yet
Contribute on Hacker News ↗