Comment by snickmy

2 months ago

this is a great way to put it, that said, it was not obvious to me that the attention space (how it is structured in LLMs) is a frequency domain

I wrote down the following in the internal Slack chat on 01.06.2025, but of course, the performance of the actual effort is much more than writing it down.

Large language models (LLMs) operate in a high-dimensional token space, where tokens (words, subwords, or characters) can be viewed as discrete signals covering the multi-dimensional knowledge space. So FFT analysis methods can be applied to reduce time domain complexity to frequency domain representation with an idea to reduce computational complexity. So we can map token signals into the frequency domain. This transformation allows us to analyze token dynamics, such as their frequency of occurrence, temporal correlations, and interactions across contexts, with computational efficiency. In this approach, embeddings are treated as signals, and their relationships in sequence are captured as patterns in the frequency domain. FFT could be used to decompose token streams into dominant frequency components, revealing periodic or recurrent patterns in language usage - these patterns are repeatable across human generated knowledge and generally follow a predefined set of rules so the signals are not just white noise, they are predictable. By analyzing these frequency components, predictions of the next token can be made by emphasizing high-energy components in the frequency spectrum, reducing noise and focusing on statistically probable outcomes. Using this method we can reduce computational overhead during training and inference by enabling lightweight spectral analysis rather than heavy attention mechanisms, especially for long-context or repetitive sequences. Also using classical signal filtering techniques (LPF, HPF, band pass) could help align model behavior with human linguistic patterns, refine token embeddings, and improve efficiency in both training and inference phases.

A cartoon:

To form a coherent idea you need to coordinate a lot of tokens. In other words, ideas are long-distance correlations between tokens. Ideas are the long-wavelength features of streams of tokens.

Is it exactly right? No. But as a cartoon it can motivate exploring an idea like this.

Exactly. Exploiting the structure of the matrix (e.g., it is well approximated by a circulant matrix) is natural if there is structure to exploit. If everything in the preprint holds up, that might suggest some symmetries (e.g., approximate stationarity in time) in the data at hand.