Comment by markisus
2 months ago
The Fourier transform is taken along the “token” dimension. However, in many applications, this dimension is not meaningful. That’s why transformers are a great option for consuming data which is permutation invariant.
I would like to see additional experiments using the lesser known Fourier transform over finite groups [1], which is permutation invariant but shares many properties with the standard Fourier transform.
I also wonder if this becomes the next big thing for LLMs, how easy will it be for inference engines(eg vLLM, llama.cpp) to integrate it?
[1] https://en.wikipedia.org/wiki/Fourier_transform_on_finite_gr...
Not an expert in this space.
Aren't tokens transformed with position dependent information in most models?
I believe llama applies a rotation to the vector based on the position in the input.
That's true in the realm of LLMs. But even in this case, the position information is added only into the first layer. Tokens in later layers can choose to "forget" this information. In addition there are applications of transformers in other domains. See https://github.com/cvg/LightGlue or https://facebookresearch.github.io/3detr/
Transformers like Llama use rotary embeddings which are applied in every single attention layer
https://github.com/huggingface/transformers/blob/222505c7e4d...
1 reply →
What's the finite group in this case?
I’m thinking the integers mod 2^n where n is something computers are good at (8, 32, 64). You have hardware support the group operation.
That is the traditional Fourier transform, except it can be a cyclic group of any size, doesn't need to be a power of 2. (Though FFTs with 2^n input size are particularly easy to implement.)
And it's not permutation invariant.
1 reply →
You mean for the group operation to be standard modular addition? In that case (as a sibling comment says) you'll recover the classic discrete Fourier transform.