Comment by foota

13 hours ago

While I largely agree, it does raise the prospect of testing this iteratively. E.g., give a model some fake environment, prompt it random things until it does something "bad" in your fake environment, and then fix whatever it claims led to its taking that action.

If you can do this and reliably reduce the rate at which it does bad things, then you could reasonably claim that it is aware of meaningful introspection.