Comment by barrkel

6 months ago

The LLM can be predict that it may lie, and when it sees tokens which are contrary to some correspondence with reality as it "understands" it, it may predict that the lie continues. It doesn't necessarily need to predict that it will reveal the lie. You can, after all, stop autoregressively producing tokens at any point, and the LLM may elect to produce an end of sequence token without revealing the lie.

Goals, such as they are, are essentially programs, or simulations, the LLM runs that help it predict (generate) future tokens.

Anyway, the whole original article is a rejection of anthropomorphism. I think the anthropomorphism is useful, but you still need to think of LLMs as deeply defective minds. And I totally reject the idea that they have intrinsic moral weight or consciousness or anything close to that.