Comment by jstanley

1 month ago

> As far as I can tell, there is nothing about the training process of these models that would encourage them to make the output of any layer apart from (n-1) meaningful as the input of layer n

Right, I had the same thought.

Even if the output was in the same "format", does the LLM even have any way to know which order the outputs will go in? The ordering of the nodes is part of our representation of the network, it's not fundamental to it.

It would be like shuffling the bytes in a PNG file and expecting the program still to understand it as a PNG file.

The more I think about this, the more I don't get this at all.

2 comments

jstanley

WithinReason 1 month ago

These layers are residual layers, so what a layer does is:

x = x + layer(x)

so it's not too surprising that they can be used recurrently

jstanley 1 month ago

Ah! Thank you