Comment by immibis

1 year ago

No. An RNN has an arbitrarily-long path from old inputs to new outputs, even if in practice it can't exploit that path. Transformers have fixed-size input windows.

A chunk of the output still goes into the transformer input, so the arbitrarily-long path still exists, it just goes through a decoding/encoding step.

no, you can give as much context to a transformer as you want, you just run out of memory

  • An RNN doesn't run out of memory from that, so they are still fundamentally different.

    How do you encode arbitrarily long positions, anyway?

    • They are different but transformers don't have fixed windows, you can extend the context or make it smaller. I think you can extend a positional encoding if it's not a learned encoding.

You can't have a fixed state and have arbitrarily-long path from input. Well you can but then it's just meaningless because you fundamentally cannot keep stuffing information of arbitrary length into a fixed state. RNNs effectively have fixed-size input windows.

  • The path is arbitrarily long, not wide. It is possible for an RNN to be made that remembers the first word of the input, no longer how long the input is. This is not possible with a transformer, so we know they are fundamentally different.

    • But an RNN isn't going to remember the first token of input. It won't know until it sees the last token whether that first token was relevant after all, so it has to learn token-specific update rules that let it guess how long to hold what kinds of information. (In multi-layer systems, the network uses ineffable abstractions rather than tokens, but the same idea applies.)

      What the RNN must be doing reminds me of "sliding window attention" --- the model learns how to partition its state between short- and long-range memories to minimize overall loss. The two approaches seem related, perhaps even equivalent up to implementation details.

      1 reply →