Comment by snickmy

4 months ago

this is a great way to put it, that said, it was not obvious to me that the attention space (how it is structured in LLMs) is a frequency domain

14 comments

snickmy

pavelstoev 4 months ago

I wrote down the following in the internal Slack chat on 01.06.2025, but of course, the performance of the actual effort is much more than writing it down.

Large language models (LLMs) operate in a high-dimensional token space, where tokens (words, subwords, or characters) can be viewed as discrete signals covering the multi-dimensional knowledge space. So FFT analysis methods can be applied to reduce time domain complexity to frequency domain representation with an idea to reduce computational complexity. So we can map token signals into the frequency domain. This transformation allows us to analyze token dynamics, such as their frequency of occurrence, temporal correlations, and interactions across contexts, with computational efficiency. In this approach, embeddings are treated as signals, and their relationships in sequence are captured as patterns in the frequency domain. FFT could be used to decompose token streams into dominant frequency components, revealing periodic or recurrent patterns in language usage - these patterns are repeatable across human generated knowledge and generally follow a predefined set of rules so the signals are not just white noise, they are predictable. By analyzing these frequency components, predictions of the next token can be made by emphasizing high-energy components in the frequency spectrum, reducing noise and focusing on statistically probable outcomes. Using this method we can reduce computational overhead during training and inference by enabling lightweight spectral analysis rather than heavy attention mechanisms, especially for long-context or repetitive sequences. Also using classical signal filtering techniques (LPF, HPF, band pass) could help align model behavior with human linguistic patterns, refine token embeddings, and improve efficiency in both training and inference phases.

evanb 4 months ago

A cartoon:

To form a coherent idea you need to coordinate a lot of tokens. In other words, ideas are long-distance correlations between tokens. Ideas are the long-wavelength features of streams of tokens.

Is it exactly right? No. But as a cartoon it can motivate exploring an idea like this.

spot5010 4 months ago
Right. This makes sense. But why Fourier space in particular. Why not, for example, a wavelet transform.
- 1024core 4 months ago
  
  > Why not, for example, a wavelet transform.
  That is a great idea for a paper. Work on it, write it up and please be sure to put my name down as a co-author ;-)
  
  4 replies →
- evanb 4 months ago
  
  Now you’re talking efficiency—-certainly a wavelet transform may also work. But wavelets tend to be more localized than FTs.
- thesz 4 months ago
  
  This way you end up with time dilated convolutional networks [1].
  [1] https://openreview.net/pdf?id=rk8wKk-R-
kridsdale1 4 months ago

I like this. Anything that connects new synapses in my skull via analogy is a good post.
3abiton 4 months ago

This is really a very interesting way of visualizing it.

kkylin 4 months ago

Exactly. Exploiting the structure of the matrix (e.g., it is well approximated by a circulant matrix) is natural if there is structure to exploit. If everything in the preprint holds up, that might suggest some symmetries (e.g., approximate stationarity in time) in the data at hand.