Comment by Analemma_

8 days ago

Literally everybody doing cutting edge AI research is trying to replace the transformer, because transformers have a bunch of undesirable properties like being quadratic in context window size. But they're also surprisingly resilient: despite the billions of dollars and man-hours poured into the field and many attempted improvements, cutting-edge models aren't all that different architecturally from the original attention paper, aside from their size and a few incidental details like the ReLU activation function, because nobody has found anything better yet.

I do expect transformers to be replaced eventually, but they do seem to have their own "bitter lesson" where trying to outperform them usually ends in failure.

My guess is there is a cost-capability tradeoff such that the O(N^2) really is buying you something you couldn't get for O(N). Behind that, there really are intelligent systems problems that boil down to solving SAT and should be NP-complete... LLMs may be able to short circuit those problems and get lucky guesses quite frequently, maybe the 'hallucinations' won't go away for anything O(N^2).