Comment by d3m0t3p

7 months ago

Do they ? LLM embedd the token sequence N^{L} to R^{LxD}, we have some attention and the output is also R^{LxD}, then we apply a projection to the vocabulary and we get R^{LxV} we get therefore for each token a likelihood over the voc. In the attention, you can have Multi Head attention (or whatever version is fancy: GQA,MLA) and therefore multiple representation, but it is always tied to a token. I would argue that there is no hidden state independant of a token.

Whereas LSTM, or structured state space for example have a state that is updated and not tied to a specific item in the sequence.

I would argue that his text is easily understandable except for the notation of the function, explaining that you can compute a probability based on previous words is understandable by everyone without having to resort to anthropomorphic terminology

17 comments

d3m0t3p

barrkel 7 months ago

There is hidden state as plain as day merely in the fact that logits for token prediction exist. The selected token doesn't give you information about how probable other tokens were. That information, that state which is recalculated in autoregression, is hidden. It's not exposed. You can't see it in the text produced by the model.

There is plenty of state not visible when an LLM starts a sentence that only becomes somewhat visible when it completes the sentence. The LLM has a plan, if you will, for how the sentence might end, and you don't get to see an instance of that plan unless you run autoregression far enough to get those tokens.

Similarly, it has a plan for paragraphs, for whole responses, for interactive dialogues, plans that include likely responses by the user.

gpm 7 months ago
The LLM does not "have" a plan.
Arguably there's reason to believe it comes up with a plan when it is computing token propabilities, but it does not store it between tokens. I.e. it doesn't possess or "have" it. It simply comes up with a plan, emits a token, and entirely throws all its intermediate thoughts (including any plan) to start again from scratch on the next token.
- barrkel 7 months ago
  
  I believe saying the LLM has a plan is a useful anthropomorphism for the fact that it does have hidden state that predicts future tokens, and this state conditions the tokens it produces earlier in the stream.
  
  2 replies →
- yorwba 7 months ago
  
  It's true that the last layer's output for a given input token only affects the corresponding output token and is discarded afterwards. But the penultimate layer's output affects the computation of the last layer for all future tokens, so it is not discarded, but stored (in the KV cache). Similarly for the antepenultimate layer affecting the penultimate layer and so on.
  So there's plenty of space in intermediate layers to store a plan between tokens without starting from scratch every time.
- NiloCK 7 months ago
  
  I don't think that the comment above you made any suggestion that the plan is persisted between token generations. I'm pretty sure you described exactly what they intended.
  
  2 replies →
- lostmsu 7 months ago
  
  This is wrong, intermediate activations are preserved when going forward.
  
  4 replies →
8note 7 months ago
this sounds like a fun research area. do LLMs have plans about future tokens?
how do we get 100 tokens of completion, and not just one output layer at a time?
are there papers youve read that you can share that support the hypothesis? vs that the LLM doesnt have ideas about the future tokens when its predicting the next one?
- Zee2 7 months ago
  
  This research has been done, it was a core pillar of the recent Anthropic paper on token planning and interpretability.
  https://www.anthropic.com/research/tracing-thoughts-language...
  See section “Does Claude plan its rhymes?”?
- XenophileJKO 7 months ago
  
  Lol... Try building systems off them and you will very quickly learn concretely that they "plan".
  It may not be as evident now as it was with earlier models. The models will fabricate preconditions needed to output the final answer it "wanted".
  I ran into this when using quasi least-to-most style structured output.