Comment by zozbot234
17 hours ago
It's not really a capability, it's more like a very costly hack and they make that very clear in the paper. Training two models (an encoder and a decoder) for the purpose of explaining a single layer at a time is not that sensible. It's neat that you can generate so much readable text about how the LLM decodes partial input, and I suppose it gives you some extra debugging ability, but that's all there is to it.
Why does it being a “costly hack” make it “not a capability?”
Using your logic, LLMs, which are very fairly described as “costly” and “a hack” do not themselves constitute a useful capability, which I hope most people would agree is obviously false.
The NLA also hallucinates, so it's still not revealing the models actual "thoughts" of the model; The paper also points out that since the NLA is a full LLM, it can make inferences that aren't actually in the activations.
But it's a useful approximation for auditing.