Comment by yorwba
2 months ago
I don't see how you could fit causal masking into this framework without having to do n different FFTs, and there's no mention of positional embeddings either, so I guess the self-attention implementation being compared against is noncausal NoPE, which would make this a case of baseline sandbagging and maybe not so impressive.
If the results were close to state-of-the-art, probably the author would've mentioned it?
They do show their model as winning every category in Long Range Arena (LRA) benchmark. Hopefully they have not excluded losing categories or better models.
Winning against their own baseline, not against the current best-performing model. Which apparently is S5 currently https://paperswithcode.com/sota/long-range-modeling-on-lra with 87.46 overall vs. 58.31 here.