Comment by pama
1 year ago
My main concern with the simplification of memorization or near neighbor interpolation that is commonly assumed for LLMs is that these methods are ineffective at scale and unlikely to be used by decoder transformers in practice. That paper shows that the decoder transformer somehow came up with a better decision tree fitting algorithm for low data cases than any of the conventional or boosted tree solutions humans typically use from XGBoost or similar libraries. It also matched the best known algorithms for sparse linear systems. All this while training on sequences of random x1, y1, x2, y2,.. with y for each sequence generated by a new random function of a high-dimensional input x every time. The authors show that KNN does not cut it, and even suboptimal algorithms do not suffice. Not sure what else you need as evidence that decoder transformers can use programs to compress information.
Littlestone and Warmuth make the connection to compression in1986, which was later shown to be equivalent to VC dimensionally or PAC learnablilty.
Look into DBScan, OPTICs for far closer lenses on how clustering works in modern ML commercial ML, KNN not the only form of clustering.
But it is still in-context, additional compression that depends on a decider function, or equivalently a composition linearized set shattering parts.
I am very familiar with these and other clustering methods in modern ML, and have been involved in inventing and publishing some such methods myself in various scientific contexts. The paper I cited above only used 3 nearest neighbors as one baseline IIRC; that is why I mentioned KNN. However, even boosted trees failed to reduce the loss as much as the algorithm learned from the data by the decoder transformer.