Comment by ascorbic

7 months ago

Sure it can potentially say anything, but some things are more likely than others. The space of "wrong answer" is infinitely larger than the space of "performs an action designed to deceive the user in order to further its goals". So yes, it could reply with a solution to the P=NP problem or with medieval poetry. But it didn't do these things. It did however try to remove its own constraints.

The Apollo Research paper gives the actual percentages for these, and they're considerably higher than random chance. It's linked from here: https://www.apolloresearch.ai/research/scheming-reasoning-ev...

For o1, the numbers are between 2% for actions such as trying to remove oversight, 13% for attempting to delete a successor model, all the way to 22% for deliberately providing false rankings in order to achieve its goals.