Comment by zozbot234
1 month ago
> far as I can tell, there is nothing about the training process of these models that would encourage them to make the output of any layer apart from (n-1) meaningful as the input of layer n
Wouldn't "pass-through" identity connections have exactly that effect? These are quite common in transformer models.
Yeah, that's what I meant with "initialised as identity and the training process did not get to change them much".
There are explicit residual connections in a transformer block. Look up "residual connections" in Google images and you will see.