Comment by visarga

1 month ago

> there is nothing about the training process of these models that would encourage them to make the output of any layer apart from (n-1) meaningful as the input of layer n

There is something that does exactly that - the residual connections. Each layer adds a delta to it, but that means they share a common space. There are papers showing the correlation across layers, of course it is not uniform across depth, but consecutive layers tend to be correlated.