Comment by seydor

2 years ago

Possibly the authors did not have a mental model about why the model worked. Attention, keys and heads may have been posthoc rationalizations. The alchemy stage may be comical but necessary

5 comments

seydor

thatguysaguy 2 years ago

I think this misses important history. This was a machine translation paper, and we were already using seq2seq RNNs with attention at the time. They didn't coin the term attention, they just realized that you could use attention from a sequence to itself. Terminology and understanding are always super path-dependent.

theGnuMe 2 years ago
RNNs worked better at the time when you reversed the target sequence.
- thatguysaguy 2 years ago
  
  That's interesting because I remember testing LSTMs for language modeling on some dataset (probably PTB), and finding that they got lower perplexity left-to-right than right-to-left.
  
  1 reply →

quickthrower2 2 years ago

Like asking evolution why brains work :-)