Comment by thatguysaguy

2 years ago

I think this misses important history. This was a machine translation paper, and we were already using seq2seq RNNs with attention at the time. They didn't coin the term attention, they just realized that you could use attention from a sequence to itself. Terminology and understanding are always super path-dependent.

RNNs worked better at the time when you reversed the target sequence.

  • That's interesting because I remember testing LSTMs for language modeling on some dataset (probably PTB), and finding that they got lower perplexity left-to-right than right-to-left.

    • There's a subtle difference here between the translation scenario and what you observed. In translation, the reversal only applies to the second sentence which will tend to present information in the same order as the first sentence (for most common language pairs).

      The improvement in perplexity here points to gradient propagation issues. If it's hard for the LSTM to remember information from the first sentence until it becomes useful in the second sentence, it may be easier to put some of the useful info from the first sentence "closer" to where it will be useful in the second sentence by putting the end of the second sentence closer to the end of the first.

      I suspect that reversing the first sentence and not reversing the second could have a similar effect.