← Back to context

Comment by aDyslecticCrow

1 year ago

From my (admittedly loose) reading of the paper, this paper particularly targets parallelization and fast training, not "vanishing gradients." However, by simplifying the recurrent units, they managed to improve both!

This is very clever and very interesting. The paper continuously calls it a "decade-old architecture," but in practice, it's still used massively, thanks to its simplicity in adapting to different domains. Placing it as a "competitor" to transformers is also not quite fully fair, as transformers and RNNs are not mutually exclusive, and there are many methods that merge them.

Improvement in RNNs is an improvement in a lot of other surprising places. A very interesting read.