← Back to context

Comment by scotty79

1 year ago

The only strength of transformers is that they can run once for each token and they can pass to themselves intermediate state as they solve your problems. They have to conceal it in tokens that look to humans like a part of the response.

It's obvious why the newest toy from openai can solve problems better mostly by just being allowed to "talk to itself" for a moment before starting the answer that human sees.

Given that, modern incarnation of RNN can be vastly cheaper than transformers provided that they can be trained.

Convolutional neural networks get more visual understanding by "reusing" their capacity across the area of the image. RNN's and transformers can have better understanding of a given problem by "reusing" their capacity to learn and infer across time (across steps of iterative process really).

When it comes to transformer architecture the attention is a red herring. It's just more or less arbitrary way to partition the network so it can be parallelized. The only bit of potential magic is with "shortcut" links between non adjacent layers that help propagate learning back through many layers.

Basically the optimal network is deep, dense (all neurons connect with all belonging to all preceding layers) that is ran in some form of recurrence.

But we don't have enough compute to train that. So we need to arbitrarily sever some connections so the whole thing is easier to parallelized. It really doesn't matter which unless we do in some obviously stupid way.

Actual inventive magic part of LLMs possibly happens in token and positional encoders.