Comment by yorwba

7 months ago

I don't see how you could fit causal masking into this framework without having to do n different FFTs, and there's no mention of positional embeddings either, so I guess the self-attention implementation being compared against is noncausal NoPE, which would make this a case of baseline sandbagging and maybe not so impressive.

If the results were close to state-of-the-art, probably the author would've mentioned it?

2 comments

yorwba

TheDudeMan 7 months ago

They do show their model as winning every category in Long Range Arena (LRA) benchmark. Hopefully they have not excluded losing categories or better models.

yorwba 7 months ago

Winning against their own baseline, not against the current best-performing model. Which apparently is S5 currently https://paperswithcode.com/sota/long-range-modeling-on-lra with 87.46 overall vs. 58.31 here.