← Back to context

Comment by yorwba

13 days ago

Attention operates entirely on hidden memory, in the sense that it usually isn't exposed to the end user. An attention head on one thinking token can attend to one thing and the same attention head on the next thinking token can attend to something entirely different, and the next layer can combine the two values, maybe on the second thinking token, maybe much later. So even nonsense filler can create space for intermediate computation to happen.