Comment by blt
2 years ago
Sure, LLMs are like 1024-gram Markov chains (or whatever the context limit is). But there are problems: 1) the transition matrix is far too huge to represent, and 2) it treats all pairs of 1024-grams as completely different, even if the first 1023 words are the same.
Function approximation solves both issues, and the Transformer is the best function clas we've found so far.
*function class