Comment by 4bpp

1 month ago

Yeah, that's what I meant with "initialised as identity and the training process did not get to change them much".

There are explicit residual connections in a transformer block. Look up "residual connections" in Google images and you will see.