Comment by markisus

4 months ago

The Fourier transform is taken along the “token” dimension. However, in many applications, this dimension is not meaningful. That’s why transformers are a great option for consuming data which is permutation invariant.

I would like to see additional experiments using the lesser known Fourier transform over finite groups [1], which is permutation invariant but shares many properties with the standard Fourier transform.

I also wonder if this becomes the next big thing for LLMs, how easy will it be for inference engines(eg vLLM, llama.cpp) to integrate it?

[1] https://en.wikipedia.org/wiki/Fourier_transform_on_finite_gr...

9 comments

markisus

3vidence 4 months ago

Not an expert in this space.

Aren't tokens transformed with position dependent information in most models?

I believe llama applies a rotation to the vector based on the position in the input.

markisus 4 months ago
That's true in the realm of LLMs. But even in this case, the position information is added only into the first layer. Tokens in later layers can choose to "forget" this information. In addition there are applications of transformers in other domains. See https://github.com/cvg/LightGlue or https://facebookresearch.github.io/3detr/
- topwalktown 4 months ago
  
  Transformers like Llama use rotary embeddings which are applied in every single attention layer
  https://github.com/huggingface/transformers/blob/222505c7e4d...
  
  1 reply →

Y_Y 4 months ago

What's the finite group in this case?

markisus 4 months ago
I’m thinking the integers mod 2^n where n is something computers are good at (8, 32, 64). You have hardware support the group operation.
- yorwba 4 months ago
  
  That is the traditional Fourier transform, except it can be a cyclic group of any size, doesn't need to be a power of 2. (Though FFTs with 2^n input size are particularly easy to implement.)
  And it's not permutation invariant.
  
  1 reply →
- Y_Y 4 months ago
  
  You mean for the group operation to be standard modular addition? In that case (as a sibling comment says) you'll recover the classic discrete Fourier transform.