Comment by zozbot234

17 hours ago

It's not really a capability, it's more like a very costly hack and they make that very clear in the paper. Training two models (an encoder and a decoder) for the purpose of explaining a single layer at a time is not that sensible. It's neat that you can generate so much readable text about how the LLM decodes partial input, and I suppose it gives you some extra debugging ability, but that's all there is to it.

2 comments

zozbot234

semiquaver 23 minutes ago

Why does it being a “costly hack” make it “not a capability?”

Using your logic, LLMs, which are very fairly described as “costly” and “a hack” do not themselves constitute a useful capability, which I hope most people would agree is obviously false.

phire 16 hours ago

The NLA also hallucinates, so it's still not revealing the models actual "thoughts" of the model; The paper also points out that since the NLA is a full LLM, it can make inferences that aren't actually in the activations.

But it's a useful approximation for auditing.