← Back to context

Comment by bjourne

2 months ago

Yeah, but the savings are theoretical. You turn a O(n^2) operation into O(nlog n). Sounds great until you realize that n is three on average. To boot, you have to use complex numbers for calculations which are also less numerically stable. So, to the best of my knowledge, FFT is not a win for ordinary convolution.

Maybe for self-attention and for their use cases n is much larger, I didn't read the article. But you still have to deal with complex numbers.

> You turn a O(n^2) operation into O(nlog n). Sounds great until you realize that n is three on average.

Sure, but are long convolutions avoided precisely because they're expensive? This paper is talking about an alternative to an attention mechanism, which covers the entire context window, no? Isn't this paper saying: you could use a long convolution for this instead, and long convolutions don't have to be slow?

> you have to use complex numbers for calculations which are also less numerically stable

I haven't heard numerical stability being a big deal in neural nets; in fact don't people often use 16-bit floats as weights to save on space? Does the numerical stability of complex numbers exceed the precision dropped off by quantization anyway? Are complex numbers really inherently less numerically stable, or are we just not as good at using them yet?

3^2 / (3*log(3)) = >6x performance improvement and, if three is a linear average, I'd expect the average improvement to be even higher. I know real world computation doesn't answer to the simple scaling equations and there may well be a constant factor >6 that eats the gains, but I don't think the two Big Os and a n=3 are sufficient to make your case.

  • That's not how O(f(n)) notation works. You can't just plug in an n to O(f(n)) / O(g(n)) and claim a specific performance improvement. You have to actually know all the factors that are stripped off by big-O to do that, and you never really do. For instance, you're ignoring the cost to transform between domains.

    > I know real world computation doesn't answer to the simple scaling equations ... but

    No, no "but". This defeats the entire claim, and you can't just "but" it back.

    Also, you appear to have used base-10 log for Log(3). It's almost certain that base-2 is more appropriate, leading to a factor of 1.8x, not 6x. But of course Log_1000(n) and Log_2(n) have the same Big-O, which is why the base is left off, so you really just cannot say anything specific at all. O(n^2) might be faster than O(n*log(n)) up to n = Graham's number.

    • >This defeats the entire claim, and you can't just "but" it back.

      You may have missed what the "but" is doing- it's agreeing with you. My entire claim is defeated, and it uses the same reasoning that that the parent used to make their claim. I'm not attempting to show that there is an improvement, only that the lack of improvement has not been demonstrated by listing two Big-Os and setting n.

      But yes, the log base 10 is my bad.

  • FYI you’re using log(3) to the base 10 there. That’s less than one so your figures look artificially good.

It is true it isn't numerically stable and a FFT isn't entirely reversible. I think to get an idea about how frequency data relates to attention is to look at JPEG. Images tend to make understanding the concept easier. For JPEG a cosine transformation is used instead of a Fourier transformation, but the principle is the same.