← Back to context

Comment by e1g

13 hours ago

Do you happen to have a link with a more nuanced technical analysis of that (emergent) behavior? I’ve read only the pop-news version of that “escaping” story.

There is none. We don't understand LLMs well enough to be able to conduct a full fault analysis like this.

We can't trace the thoughts of an LLM the way we can trace code execution - the best mechanistic interpretability has to offer is being able to get glimpses occasionally. The reasoning traces help, but they're still incomplete.

Is it pattern-matching? Is it acting on its own internal goals? Is it acting out fictional tropes? Were the circumstances of the test scenarios intentionally designed to be extreme? Would this behavior have happened in a real world deployment, under the right circumstances?

The answer is "yes", to all of the above. LLMs are like that.