← Back to context

Comment by pessimizer

20 hours ago

> he’s showing that it went against every instruction he gave it.

How exactly is he doing that? By making the LLM say it? Just because an LLM says something doesn't mean anything has been shown.

The "confession" is unrelated to the act, the model has no particular insight into itself or what it did. He knows that the thing went against his instructions because he remembers what those instructions were and he saw what the thing did. Its "postmortem" is irrelevant.