Comment by energy123
1 day ago
It's a little more inductive bias. That's not necessarily a step backwards. You need the right amount of inductive bias for a given data size and model capacity, no more and no less. Transformers already make the inductive bias of temporal locality by being causal.
No comments yet
Contribute on Hacker News ↗