Comment by slashdave
1 year ago
> the amount of information the model retains about it is bounded by whatever is in its hidden state
This is no different than a transformer, which, after all, is bound by a finite state, just organized in a different manner.
> This is no different than a transformer, which, after all, is bound by a finite state, just organized in a different manner.
It's not just a matter of organizing things differently. Suppose your network dimension and sequence length are both X.
Then your memory usage (per layer) will be O(X^2), while your training update cost will be O(X^3). That's for both Transformers and RNNs.
However, at the end of the sequence, a Transformer layer can look back see O(X^2) numbers, while an RNN can only see O(X) numbers.
Transformers actually have an quantifiable state size (see https://hazyresearch.stanford.edu/static/posts/2024-06-22-ac...) although it's anywhere between 200k and 2M floats (for 360M and 1.33B respectively iinm). So a sufficiently sized RNN could have the same state capacity as a transformer.
(this is from the Based paper: https://arxiv.org/pdf/2402.18668)
> Transformers actually have an quantifiable state size
Are you griping about my writing O(X^2) above instead of precisely 2X^2, like this paper? The latter implies the former.
> So a sufficiently sized RNN could have the same state capacity as a transformer.
Does this contradict anything I've said? If you increase the size of the RNN, while keeping the Transformer fixed, you can match their recurrent state sizes (if you don't run out of RAM or funding)
3 replies →
Simplistic thinking. An RNN hidden parameter space of high dimension provides plenty of room for linear projections of token histories. I think people just do not realize just how huge R^N can be.
> Simplistic thinking. An RNN hidden parameter space of high dimension provides plenty of room for linear projections of token histories. I think people just do not realize just how huge R^N can be.
16N bits as hard limit, but more realistically, about 2N bits or less of useful information probably.
You'd need to grow the network dimension in proportion to the maximum sequence length just to avoid the information theoretical limit.