Comment by sosodev

4 months ago

LLMs don't "read" text sequentially, right?

4 comments

sosodev

The causal masking means future tokens don’t affect previous tokens embeddings as they evolve throughout the model, but all tokens a processed in parallel… so, yes and no. See this previous HN post (https://news.ycombinator.com/item?id=45644328) about how bidirectional encoders are similar to diffusion’s non-linear way of generating text. Vision transformers use bidirectional encoding b/c of the non-causal nature of image pixels.

Merik 4 months ago
Didn’t anthropic show that the models engage in a form of planning such that it is predicting a possible future subsequent tokens that then affects prediction of the next token: https://transformer-circuits.pub/2025/attribution-graphs/bio...
- ACCount37 4 months ago
  
  Sure, an LLM can start "preparing" for token N+4 at token N. But that doesn't change that the token N can't "see" N+1.
  Causality is enforced in LLMs - past tokens can affect future tokens, but not the other way around.

anon291 4 months ago

If the attention is masked, then yes they do.