Comment by nextos

5 months ago

The xLSTM could become a good alternative to transformers: https://arxiv.org/abs/2405.04517. On very long contexts, such as those arising in DNA models, these models perform really well.

There's a big state-space model comeback initiated by the S3-Mamba saga. RWKV, which is a hybrid between classical RNNs and transformers, is also worth mentioning.

3 comments

nextos

bob1029 5 months ago

I was just about to post this. There was a MLST podcast about it a few days ago:

https://www.youtube.com/watch?v=8u2pW2zZLCs

Lots of related papers referenced in the description.

RossBencina 5 months ago
One claim from that podcast was that the xLSTM attention mechanism is (in practical implementation) more efficient than (transformer) flash attention, and therefore promises to significantly reduces the time/cost of test-time compute.
- korbip 5 months ago
  
  Test it out here:
  https://github.com/NX-AI/mlstm_kernels
  https://huggingface.co/NX-AI/xLSTM-7b