Comment by jampekka
2 months ago
The actual mechanism at least is quite simple. Essentially it takes the FFT of the input embeddings, multiplies it elementwise with weights that are gotten from the input embeddings using an MLP (plus a constant (but learnable) bias) and then runs it through an activation function and finally takes the inverse FFT.
The "frequencies" are probably something quite abstract. FFT is often used in ways where there aren't really clear frequency interpretation. The use is due to convenient mathematical properties (e.g. the convolution theorem).
Rather amazing if this really works well. Very elegant.
Essentially leveraging Convolution theorem. Same philosophy pops up in many places, e.g. DFT calculations
Yep, it's very common to use this. we used it for grid based pairwise interactions since it turns N^2 op into N log N.
That sounds just like Particle Mesh Ewald, which we use in molecular dynamics to approximate the forces of pairwise interactions (interpolated on a grid). Ihttps://en.wikipedia.org/wiki/P3M
1 reply →
Sorry, added the convolution theorem part in an edit after your comment.
I’m still confused. Does it treat the input tokens as a sampled waveform?
I mean, say I have some text file in ASCII. Do I then just pretend it’s raw wav and do FFT on it? I guess it can give me some useful information (like does it look like any particular natural language or is it just random; sometimes used in encrytion analysis of simple substitution cyphers). It feels surprising that revers FFT can get a coherent output after fiddling with the distribution.
Do keep in mind that FFT is a lossless, equivalent representation of the original data.
As I understand it, the token embedding stream would be equivalent to multi-channel sampled waveforms. The model either needs to learn the embeddings by back-propagating through FFT and IFFT, or use some suitable tokenization scheme which the paper doesn't discuss (?).
It seems unlikely to work for language.
It embeds them first into vectors. The input is a real matrix with (context length)x(embedding size) dimensions.
No. The FFT is an operation on a discrete domain, it is not the FT. In the same way audio waveforms are processed by an FFT you bucket frequencies which is conceptually a vector. Once you have a vector, you do machine learning like you would with any vector (except you do some FT in this case, I haven’t read the paper).
Most likely the embedding of the token is passed through FFT