Comment by janalsncm

8 months ago

There are alternatives that optimize around the edges. Like Deepseek’s Multi-head Latent Attention, or Grouped Query Attention. DeepSeek also showed an optimization on Mixture of Experts. These are all clear improvements to the Vaswani architecture.

There are optimizations like extreme 1.58 bit quant that can be applied to anything.

There are architectures that stray farther. Like SSMs and some attempts at bringing the RNN back from the dead. And even text diffusion models that try to generate paragraphs like we generate images i.e. not word by word.

1 comment

janalsncm

dr_dshiv 8 months ago

Mixture of depths, too.