Comment by gugagore
6 months ago
I'm not sure what you mean by "hidden state". If you set aside chain of thought, memories, system prompts, etc. and the interfaces that don't show them, there is no hidden state.
These LLMs are almost always, to my knowledge, autoregressive models, not recurrent models (Mamba is a notable exception).
If you dont know, that's not necessarily anyone's fault, but why are you dunking into the conversation? The hidden state is a foundational part of a transformers implementation. And because we're not allowed to use metaphors because that is too anthropomorphic, then youre just going to have to go learn the math.
The comment you are replying to is not claiming ignorance of how models work. It is saying that the author does know how they work, and they do not contain anything that can properly be described as "hidden state". The claimed confusion is over how the term "hidden state" is being used, on the basis that it is not being used correctly.
I don't think your response is very productive, and I find that my understanding of LLMs aligns with the person you're calling out. We could both be wrong, but I'm grateful that someone else spoke saying that it doesn't seem to match their mental model and we would all love to learn a more correct way of thinking about LLMs.
Telling us to just go and learn the math is a little hurtful and doesn't really get me any closer to learning the math. It gives gatekeeping.
Do you appreciate a difference between an autoregressive model and a recurrent model?
The "transformer" part isn't under question. It's the "hidden state" part.
Hidden state in the form of the activation heads, intermediate activations and so on. Logically, in autoregression these are recalculated every time you run the sequence to predict the next token. The point is, the entire NN state isn't output for each token. There is lots of hidden state that goes into selecting that token and the token isn't a full representation of that information.
State typically means between interactions. By this definition a simple for loop has “hidden state” in the counter.
Hidden layer is a term of art in machine learning / neural network research. See https://en.wikipedia.org/wiki/Hidden_layer . Somehow this term mutated into "hidden state", which in informal contexts does seem to be used quite often the way the grandparent comment used it.
1 reply →
That's not what "state" means, typically. The "state of mind" you're in affects the words you say in response to something.
Intermediate activations isn't "state". The tokens that have already been generated, along with the fixed weights, is the only data that affects the next tokens.
Sure it's state. It logically evolves stepwise per token generation. It encapsulates the LLM's understanding of the text so far so it can predict the next token. That it is merely a fixed function of other data isn't interesting or useful to say.
All deterministic programs are fixed functions of program code, inputs and computation steps, but we don't say that they don't have state. It's not a useful distinction for communicating among humans.
5 replies →
Plus a randomness seed.
The 'hidden state' being referred to here is essentially the "what might have been" had the dice rolls gone differently (eg, been seeded differently).
1 reply →
do LLM models consider future tokens when making next token predictions?
eg. pick 'the' as the next token because there's a strong probability of 'planet' as the token after?
is it only past state that influences the choice of 'the'? or that the model is predicting many tokens in advance and only returning the one in the output?
if it does predict many, id consider that state hidden in the model weights.
I think recent Anthropic work showed that they "plan" future tokens in advance in an emergent way:
https://www.anthropic.com/research/tracing-thoughts-language...
oo thanks!
The most obvious case of this is in terms of `an apple` vs `a pear`. LLMs never get the a-an distinction wrong, because their internal state 'knows' the word that'll come next.
If I give an LLM a fragment of text that starts with, "The fruit they ate was an <TOKEN>", regardless of any plan, the grammatically correct answer is going to force a noun starting with a vowel. How do you disentangle the grammar from planning?
Going to be a lot more "an apple" in the corpus than "an pear"