Comment by zozbot234

1 month ago

> far as I can tell, there is nothing about the training process of these models that would encourage them to make the output of any layer apart from (n-1) meaningful as the input of layer n

Wouldn't "pass-through" identity connections have exactly that effect? These are quite common in transformer models.

2 comments

zozbot234

4bpp 1 month ago

Yeah, that's what I meant with "initialised as identity and the training process did not get to change them much".

SCLeo 1 month ago

There are explicit residual connections in a transformer block. Look up "residual connections" in Google images and you will see.