← Back to context

Comment by NiloCK

6 months ago

Plus a randomness seed.

The 'hidden state' being referred to here is essentially the "what might have been" had the dice rolls gone differently (eg, been seeded differently).

No, that's not quite what I mean. I used the logits in another reply to point out that there is data specific to the generation process that is not available from the tokens, but there's also the network activations adding up to that state.

Processing tokens is a bit like ticks in a CPU, where the model weights are the program code, and tokens are both input and output. The computation that occurs logically retains concepts and plans over multiple token generation steps.

That it is fully deterministic is no more interesting than saying a variable in a single threaded program is not state because you can recompute its value by replaying the program with the same inputs. It seems to me that this uninteresting distinction is the GP's issue.