Comment by estebarb

5 hours ago

The criticisms are not strawmans, are actually well grounded on math. For instance, promoting energy based models.

In a probability distribution model, the model is always forced to output a probability for a set of tokens, even if all the states are non sense. In an energy based model, the model can infer that a states makes no sense at all and can backtrack by itself.

Notice that diffusion models, DINO and other successful models are energy based models, or end up being good proxies of the data density (density is a proxy of entropy ~ information).

Finally, all probability models can be thought as energy based, but not all EBM output probabilities distributions.

So, his argument is not against transformers or the architectures themselves, but more about the learned geometry.