Comment by BarakWidawsky
6 hours ago
You’re mostly right but conflating attention with autoregressive/causal which is the real issue that prevents you from using more compute
You can use diffusion with attention, and this model does in fact use attention
Yes, I should have said autoregressive.