Comment by korbip

8 months ago

There is a LOT of effort in the research community currently:

1. Improving the Self-Attention in the Transformer as is, keeping the quadratic complexity, which has some theoretical advantage in principle[1]: The most hyped one probably DeepSeek's Multi-Latent Attention[15], which kind of is Attention still - but also somehow different.

2. Linear RNNs: This starts from Linear Attention[2], DeltaNet[3], RKWV[4], Retention[5], Gated Linear Attention[6], Mamba[7], Griffin[8], Based[9], xLSTM[10], TTT[11], Gated DeltaNet[12], Titans[13].

They all have an update like: C_{t} = F_{t} C_{t-1} + i_{t} k_{t} v_{t}^T with a cell state C and output h_{t} = C_{t}^T q_{t}. There's a few tricks that made these work and now being very strong competitors to Transformers. The key here is the combination of an linear associative memory (aka Hopfield Network, aka Fast Weight Programmer, aka State Expansion...) and pushing it into a sequence with gating similar to the original LSTM (input, forget, output gate) - while here this is only dependent on the current input not the previous state for linearity. The linearity is needed to make it sequence-parallelizable, there are efforts now to add non-linearities again, but let's see. Their main benefit+downside both is that they have a fixed-size state, and therefore linear (vs Transformer-quadratic) time complexity.

For larger sizes they have become popular in hybrids with Transformer (Attention) Blocks, as there are problems with long context tasks [14]. Cool thing is they can also be distilled from pre-trained Transformers with not too much performance drop [16].

3. Along the sequence dimension most things can be categorized in these two. Attention and Linear (Associative Memory Enhanced) RNNs are heavily using Matrix Multiplications and anything else would be a waste of FLOPs on current GPUs. The essence is how to store information and how to interact with it, there might be still interesting directions as other comments show. Other important topics that go into the depth / width of the model are: Mixture of Experts, Iteration (RNNs) in Depth[17].

Disclaimer: I'm author of xLSTM and we recently released a 7B model [18] trained at NXAI, currently the fastest linear RNN at this scale and performance. Happy to answer more questions on this or the current state in this field of research.

[1] https://arxiv.org/abs/2008.02217

[2] https://arxiv.org/abs/2006.16236

[3] https://arxiv.org/pdf/2102.11174

[4] https://github.com/BlinkDL/RWKV

[5] https://arxiv.org/abs/2307.08621

[6] https://arxiv.org/pdf/2312.00752

[7] https://arxiv.org/abs/2312.06635

[8] https://arxiv.org/pdf/2402.19427

[9] https://arxiv.org/abs/2402.18668

[10] https://arxiv.org/abs/2405.04517

[11] https://arxiv.org/abs/2407.04620

[12] https://arxiv.org/abs/2412.06464