Comment by famouswaffles

1 year ago

You can't have a fixed state and have arbitrarily-long path from input. Well you can but then it's just meaningless because you fundamentally cannot keep stuffing information of arbitrary length into a fixed state. RNNs effectively have fixed-size input windows.

The path is arbitrarily long, not wide. It is possible for an RNN to be made that remembers the first word of the input, no longer how long the input is. This is not possible with a transformer, so we know they are fundamentally different.

  • But an RNN isn't going to remember the first token of input. It won't know until it sees the last token whether that first token was relevant after all, so it has to learn token-specific update rules that let it guess how long to hold what kinds of information. (In multi-layer systems, the network uses ineffable abstractions rather than tokens, but the same idea applies.)

    What the RNN must be doing reminds me of "sliding window attention" --- the model learns how to partition its state between short- and long-range memories to minimize overall loss. The two approaches seem related, perhaps even equivalent up to implementation details.

    • The most popular RNNs (the ones that were successful enough for Google translate and the like) actually had this behavior baked in to the architecture, called "LSTMs", "Long-Short Term Memory"