Comment by andy12_

1 day ago

I remember doing this kind of test in a vanilla transformer trained on my laptop on a small text dataset. I basically added N^3 attention where each layer could pay attention to previous layers. It didn't improve anything and was much slower.

Hard to say whether something scales or not from a couple dozen million parameters to an actual billion-sized model, but I have the impression that the nature of the residual stream and its high dimensionality allows any layer to access information of previous layers if the transformers needs it.