← Back to context

Comment by davesque

2 years ago

The hype really is drowning out the simple fact that basically no one really knows what these models are doing. Why does it matter so much that we include auto-correlation of embedding vectors as the "attention" mechanism in these models? And that we do this sufficiently many times across all the layers? And that we blindly smoosh values together with addition and call it a "skip" connection? Yes, you can tell me a bunch of stuff about gradients and residual information, but tell me why any of this stuff is or isn't a good model of causality.