← Back to context

Comment by HarHarVeryFunny

1 year ago

I'd guess because the Transformer architecture is (I assume) fairly close to the way that our brain learns and produces language - similar hierarchical approach and perhaps similar type of inter-embedding attention-based copying?

Similar to how CNNs are so successful at image recognition, because they also roughly follow the way we do it too.

Other seq-2-seq language approaches work too, but not as good as Transformers, which I'd guess is due to transformers better matching our own inductive biases, maybe due to the specific form of attention.