Comment by trott
1 year ago
> The big-O here is irrelevant for the architectures since it's all in the configuration & implementation of the model; i.e. there is no relevant asymptote to compare.
?!
NNs are like any other algorithm in this regard. Heck, look at the bottom of page 2 of the Were RNNs All We Needed paper. It has big-O notation there and elsewhere.
> I was responding to
>> a Transformer layer can look back see O(X^2) numbers, while an RNN can only see O(X) numbers
In the BASED paper, in Eq. 10, sizeof(s) = 2dN. But I defined d = N = X above. Ergo, sizeof(s) = 2X^2 = O(X^2).
For minGRU, sizeof(s) = d. Ergo, sizeof(s) = X = O(X).
That's just the state calculation which would be O(N) and O(1) respectively. The based paper is saying if you made Transformers recurrent you would have a state size of 2Nd -> O(N), while based has a state size of d*d' -> O(1).
Transformers do have O(N^2) time & memory complexity, and Based/RNN/SSM {O(N) time, O(1) mem}, with respect to sequence length if that's what you mean. The point is it doesn't really give an indication of quality.
We can choose our constant arbitrarily so the big-O you've stated only indicates memory/time-complexity not 'look-back' ability relevant to any task. If you input the entire sequence N times into an RNN, you also have perfect recall with O(N^2) but it's not exactly an efficient use of our resources.
Ideally our state memory is maximally utilized, this is the case for RNNs in the limit (although likely oversubscribed) but is not the case for transformers. The holy grail is to have an input-dependent state-size, however that is quite difficult.