← Back to context

Comment by xeonmc

2 months ago

Basically leverages convolution theorem[0]: expensive convolutions in direct space becomes simple multiplications in reciprocal space, and vice versa.

Whereever you have a convolution operation on your data, transform them to the conjugate domain to turn it into multiplication.

In other words, work in the domain that is natural to your data.

[0] https://en.wikipedia.org/wiki/Convolution_theorem

this is a great way to put it, that said, it was not obvious to me that the attention space (how it is structured in LLMs) is a frequency domain

  • I wrote down the following in the internal Slack chat on 01.06.2025, but of course, the performance of the actual effort is much more than writing it down.

    Large language models (LLMs) operate in a high-dimensional token space, where tokens (words, subwords, or characters) can be viewed as discrete signals covering the multi-dimensional knowledge space. So FFT analysis methods can be applied to reduce time domain complexity to frequency domain representation with an idea to reduce computational complexity. So we can map token signals into the frequency domain. This transformation allows us to analyze token dynamics, such as their frequency of occurrence, temporal correlations, and interactions across contexts, with computational efficiency. In this approach, embeddings are treated as signals, and their relationships in sequence are captured as patterns in the frequency domain. FFT could be used to decompose token streams into dominant frequency components, revealing periodic or recurrent patterns in language usage - these patterns are repeatable across human generated knowledge and generally follow a predefined set of rules so the signals are not just white noise, they are predictable. By analyzing these frequency components, predictions of the next token can be made by emphasizing high-energy components in the frequency spectrum, reducing noise and focusing on statistically probable outcomes. Using this method we can reduce computational overhead during training and inference by enabling lightweight spectral analysis rather than heavy attention mechanisms, especially for long-context or repetitive sequences. Also using classical signal filtering techniques (LPF, HPF, band pass) could help align model behavior with human linguistic patterns, refine token embeddings, and improve efficiency in both training and inference phases.

  • A cartoon:

    To form a coherent idea you need to coordinate a lot of tokens. In other words, ideas are long-distance correlations between tokens. Ideas are the long-wavelength features of streams of tokens.

    Is it exactly right? No. But as a cartoon it can motivate exploring an idea like this.

  • Exactly. Exploiting the structure of the matrix (e.g., it is well approximated by a circulant matrix) is natural if there is structure to exploit. If everything in the preprint holds up, that might suggest some symmetries (e.g., approximate stationarity in time) in the data at hand.

Yeah basic math space transformation sandwich : 1) turn data into another space 2) operate in that space 3) transform back into original space. To optimize this, optimize each step and work as much as possible in the most efficient space

> In other words, work in the domain that is natural to your data.

Why would multiplication be more "natural" to a domain than convolution, as opposed to just simpler to calculate?

  • As an example, if you're trying to "smooth out" some signal, what you're really trying to do is remove the high frequency components. So using a Fourier transform to convert it to the frequency domain lets you directly work with the frequency data, rather than indirectly in the time/space/whatever domain. The fact that the operation is simpler in the frequency domain is a good hint that you've picked the "natural" space in which to look at the problem. Of course, there's no formal definition of "naturalness," and at the end of the day, you get the same result either way.

  • One way to think about things is in terms of diagonalization. A generic linear operator is a fairly complicated thing that mixes information from different dimensions. When you diagonalize an operator, it's the same operator, but you're choosing a coordinate system where it becomes clear that all it was really doing was stretching along each coordinate axis, so you've broken it down into something that acts independently in a simple way on each dimension. The Fourier transform is unitary, so the intuition is that you're basically doing something like a rigid transformation of space (e.g. a rotation), and when you look from the correct angle, you see that your original operator wasn't some complicated "add a stretched version of this dimension to this other dimension minus a stretched version of a third dimension", but just "stretch this by this, this by this, etc."

    On the other hand, convolution itself is already "just" multiplication. e.g. multiplying polynomials is convolution of their coefficients (to get the x^n coefficient, you need to add up all the combinations of a_i a_j x^i x^j where i+j=n), and this point of view also applies to e.g. linear time-invariant systems[0] by thinking of your function as the weights of an infinite weighted sum (so sort of an infinite polynomial) of time-shift operators (and this point of view works for other groups, not just time shifts). So f(t) is then the "coefficient" for the t-shift, and multiplying two such weighted sums again has you convolve the coefficients (so your original functions). The jargon way to say this is that your space of G-invariant functions is secretly the free algebra generated by G (G being a group). From that point of view, convolution is the "natural" multiplication on G-invariant functions. One can then ask whether there's a Fourier transform for other groups, which leads to abstract harmonic analysis. e.g. the Mellin transform is the Fourier transform for scale/stretch invariant functions as opposed to shift invariant.

    [0] The typical systems that one studies in signal processing contexts where convolution and Fourier transforms are your bread and butter: https://en.wikipedia.org/wiki/Linear_time-invariant_system

    • This is a good general tool for less-mathematically-deep folks to keep in their pocket: look for well-behaved objects that do something nice under the operation you're interested in. Typical "well-behaved" objects do things like stay where they are, or end up as a constant multiple of themselves, or end up as 0 or 1, or something like that. Then try to represent everything else in terms of those objects, so that you can take advantage of their convenient behavior. Less-difficult examples include:

      - Prime factorization: primes have nice properties, and you can turn every integer into a product of primes (polynomial factorization is an extension of this idea) and work with the nice prime properties

      - Vector spaces: basis vectors have nice properties, so you write vectors as sums of them and do operations on the coefficients instead of the vectors themselves

      - The exponential function: it's the unique function with f'(x) = f(x), so you try to turn everything else into exponentials anytime you have to solve some painful differential equation because you know those terms will go away

      - Fixed points in dynamical systems: if you don't want to analyze how arbitrary things change, find the points that don't, then think of the other points as (fixed point) + (small perturbation) and reduce your work to handling the perturbation

      - Taylor series: polynomials are easy, smooth functions are hard, so turn your smooth function into a polynomial and do polynomial things with it

      3 replies →

  • > just simpler to calculate

    That's all that "natural" means in this context. It's like "elegant" -- if it just takes less effort to get the result u need then why wouldn't you take the easier route?

Is reciprocal space always just 1/space as in frequency=1/t?

  • In the case of FFTs, no. Which is why I prefer the term Fourier space. I don’t like frequency domain because I frequently work with 3-D and 5–D FFTs while I’ve always felt frequency is connected to single dimension FFT.

    • Maybe something like "recurrence space" would be better. Frequency does have a physical interpretation which could be misconstrued, e.g. the FFT of a wave in the space domain yields the wavenumber in the independent variable, not the frequency.

  • Ite called reciprocal because the fourier transformation is it's own inverse, and the input and output space have the same 'shape' (functions from the reals to the complex numbers).

    So they are considered two sides of the same coin. And reciprocal in that sense.

Yeah, but the savings are theoretical. You turn a O(n^2) operation into O(nlog n). Sounds great until you realize that n is three on average. To boot, you have to use complex numbers for calculations which are also less numerically stable. So, to the best of my knowledge, FFT is not a win for ordinary convolution.

Maybe for self-attention and for their use cases n is much larger, I didn't read the article. But you still have to deal with complex numbers.

  • > You turn a O(n^2) operation into O(nlog n). Sounds great until you realize that n is three on average.

    Sure, but are long convolutions avoided precisely because they're expensive? This paper is talking about an alternative to an attention mechanism, which covers the entire context window, no? Isn't this paper saying: you could use a long convolution for this instead, and long convolutions don't have to be slow?

    > you have to use complex numbers for calculations which are also less numerically stable

    I haven't heard numerical stability being a big deal in neural nets; in fact don't people often use 16-bit floats as weights to save on space? Does the numerical stability of complex numbers exceed the precision dropped off by quantization anyway? Are complex numbers really inherently less numerically stable, or are we just not as good at using them yet?

  • 3^2 / (3*log(3)) = >6x performance improvement and, if three is a linear average, I'd expect the average improvement to be even higher. I know real world computation doesn't answer to the simple scaling equations and there may well be a constant factor >6 that eats the gains, but I don't think the two Big Os and a n=3 are sufficient to make your case.

    • That's not how O(f(n)) notation works. You can't just plug in an n to O(f(n)) / O(g(n)) and claim a specific performance improvement. You have to actually know all the factors that are stripped off by big-O to do that, and you never really do. For instance, you're ignoring the cost to transform between domains.

      > I know real world computation doesn't answer to the simple scaling equations ... but

      No, no "but". This defeats the entire claim, and you can't just "but" it back.

      Also, you appear to have used base-10 log for Log(3). It's almost certain that base-2 is more appropriate, leading to a factor of 1.8x, not 6x. But of course Log_1000(n) and Log_2(n) have the same Big-O, which is why the base is left off, so you really just cannot say anything specific at all. O(n^2) might be faster than O(n*log(n)) up to n = Graham's number.

      1 reply →

    • FYI you’re using log(3) to the base 10 there. That’s less than one so your figures look artificially good.

  • It is true it isn't numerically stable and a FFT isn't entirely reversible. I think to get an idea about how frequency data relates to attention is to look at JPEG. Images tend to make understanding the concept easier. For JPEG a cosine transformation is used instead of a Fourier transformation, but the principle is the same.