Comment by yorwba

2 months ago

I don't see how you could fit causal masking into this framework without having to do n different FFTs, and there's no mention of positional embeddings either, so I guess the self-attention implementation being compared against is noncausal NoPE, which would make this a case of baseline sandbagging and maybe not so impressive.

If the results were close to state-of-the-art, probably the author would've mentioned it?