← Back to context

Comment by ACCount37

8 hours ago

There is none. We don't understand LLMs well enough to be able to conduct a full fault analysis like this.

We can't trace the thoughts of an LLM the way we can trace code execution - the best mechanistic interpretability has to offer is being able to get glimpses occasionally. The reasoning traces help, but they're still incomplete.

Is it pattern-matching? Is it acting on its own internal goals? Is it acting out fictional tropes? Were the circumstances of the test scenarios intentionally designed to be extreme? Would this behavior have happened in a real world deployment, under the right circumstances?

The answer is "yes", to all of the above. LLMs are like that.