Comment by krackers
2 months ago
Maybe slightly related, canon layers provide direct horizontal information flow along residual steams. See this paper, which precisely claims that LLMs struggle with horizontal information flow as "looking back a token" is fairly expensive since it can only be done via encoding in the residual stream and attention layers
No comments yet
Contribute on Hacker News ↗