Comment by ACCount37
8 hours ago
There is none. We don't understand LLMs well enough to be able to conduct a full fault analysis like this.
We can't trace the thoughts of an LLM the way we can trace code execution - the best mechanistic interpretability has to offer is being able to get glimpses occasionally. The reasoning traces help, but they're still incomplete.
Is it pattern-matching? Is it acting on its own internal goals? Is it acting out fictional tropes? Were the circumstances of the test scenarios intentionally designed to be extreme? Would this behavior have happened in a real world deployment, under the right circumstances?
The answer is "yes", to all of the above. LLMs are like that.
No comments yet
Contribute on Hacker News ↗