Comment by zamalek

4 days ago

It's because of how transformers work, especially the fact that the output layer is a bunch of weights which we quite literally do a weighted random choice from. My hunch is that diffusion models would have a higher chance of doing real reasoning - or something like a latent space for reasoning.

Thinking that LLMs are intelligent arises from an incomplete understanding of how they work or, alternatively, having shareholders to keep happy.