Comment by emtel

4 days ago

This just isn’t true - one interesting paper on the topic: https://arxiv.org/abs/2212.03827

That paper doesn't contradict the parent. It's just pointing out that you can extract knowledge from the LLM with good accuracy by

"... finding a direction in activation space that satisfies logical consistency properties, such as that a statement and its negation have opposite truth values"

The LLM itself still has no idea of the truth or falsity of what it spits out. But you can more accurately retrieve yes/no answers to knowledge encoded in the model by using this specific trick - it's a validation step you can impose - making it less likely that the yes/no answer is wrong.

Can you say a bit more? Just reading the abstract, it's not clear to me how this contradicts the parent comment.