Comment by gugagore

7 months ago

I'll also point out what is most important part from your original message:

> LLMs have hidden state not necessarily directly reflected in the tokens being produced, and it is possible for LLMs to output tokens in opposition to this hidden state to achieve longer-term outcomes (or predictions, if you prefer).

But what does it mean for an LLM to output a token in opposition to its hidden state? If there's a longer-term goal, it either needs to be verbalized in the output stream, or somehow reconstructed from the prompt on each token.

There’s some work (a link would be great) that disentangles whether chain-of-thought helps because it gives the model more FLOPs to process, or because it makes its subgoals explicit—e.g., by outputting “Okay, let’s reason through this step by step...” versus just "...." What they find is that even placeholder tokens like "..." can help.

That seems to imply some notion of evolving hidden state! I see how that comes in!

But crucially, in autoregressive models, this state isn’t persisted across time. Each token is generated afresh, based only on the visible history. The model’s internal (hidden) layers are certainly rich and structured and "non verbal".

But any nefarious intention or conclusion has to be arrived at on every forward pass.

2 comments

gugagore

barrkel 7 months ago

The LLM can be predict that it may lie, and when it sees tokens which are contrary to some correspondence with reality as it "understands" it, it may predict that the lie continues. It doesn't necessarily need to predict that it will reveal the lie. You can, after all, stop autoregressively producing tokens at any point, and the LLM may elect to produce an end of sequence token without revealing the lie.

Goals, such as they are, are essentially programs, or simulations, the LLM runs that help it predict (generate) future tokens.

Anyway, the whole original article is a rejection of anthropomorphism. I think the anthropomorphism is useful, but you still need to think of LLMs as deeply defective minds. And I totally reject the idea that they have intrinsic moral weight or consciousness or anything close to that.

inciampati 7 months ago

You're correct, the distinction matters. Autoregressive models have no hidden state between tokens, just the visible sequence. Every forward pass starts fresh from the tokens alone.But that's precisely why they need chain-of-thought: they're using the output sequence itself as their working memory. It's computationally universal but absurdly inefficient, like having amnesia between every word and needing to re-read everything you've written.https://thinks.lol/2025/01/memory-makes-computation-universa...