Comment by blt

2 years ago

Sure, LLMs are like 1024-gram Markov chains (or whatever the context limit is). But there are problems: 1) the transition matrix is far too huge to represent, and 2) it treats all pairs of 1024-grams as completely different, even if the first 1023 words are the same.

Function approximation solves both issues, and the Transformer is the best function clas we've found so far.