← Back to context

Comment by seydor

2 years ago

Possibly the authors did not have a mental model about why the model worked. Attention, keys and heads may have been posthoc rationalizations. The alchemy stage may be comical but necessary

I think this misses important history. This was a machine translation paper, and we were already using seq2seq RNNs with attention at the time. They didn't coin the term attention, they just realized that you could use attention from a sequence to itself. Terminology and understanding are always super path-dependent.

  • RNNs worked better at the time when you reversed the target sequence.

    • That's interesting because I remember testing LSTMs for language modeling on some dataset (probably PTB), and finding that they got lower perplexity left-to-right than right-to-left.

      1 reply →