Comment by phire

12 hours ago

The NLA also hallucinates, so it's still not revealing the models actual "thoughts" of the model; The paper also points out that since the NLA is a full LLM, it can make inferences that aren't actually in the activations.

But it's a useful approximation for auditing.