Comment by andy12_
1 day ago
I remember doing this kind of test in a vanilla transformer trained on my laptop on a small text dataset. I basically added N^3 attention where each layer could pay attention to previous layers. It didn't improve anything and was much slower.
Hard to say whether something scales or not from a couple dozen million parameters to an actual billion-sized model, but I have the impression that the nature of the residual stream and its high dimensionality allows any layer to access information of previous layers if the transformers needs it.
No comments yet
Contribute on Hacker News ↗