Comment by nyrikki
1 year ago
Because you cite is about:
> in-context learning
LLMs have no concept of the symantic meaning of what they do, they just are dealing with next token prediction.
"in-context learning" is the problem, not the solution to general programming tasks.
Memoryless, ergodic, sub Turing complete problems are a very tiny class.
Think about how the Entscheidungsproblem relates to halting or the frame problem and the specification problem may be a path.
But that paper isn't solving the problem at hand.
My main concern with the simplification of memorization or near neighbor interpolation that is commonly assumed for LLMs is that these methods are ineffective at scale and unlikely to be used by decoder transformers in practice. That paper shows that the decoder transformer somehow came up with a better decision tree fitting algorithm for low data cases than any of the conventional or boosted tree solutions humans typically use from XGBoost or similar libraries. It also matched the best known algorithms for sparse linear systems. All this while training on sequences of random x1, y1, x2, y2,.. with y for each sequence generated by a new random function of a high-dimensional input x every time. The authors show that KNN does not cut it, and even suboptimal algorithms do not suffice. Not sure what else you need as evidence that decoder transformers can use programs to compress information.
Littlestone and Warmuth make the connection to compression in1986, which was later shown to be equivalent to VC dimensionally or PAC learnablilty.
Look into DBScan, OPTICs for far closer lenses on how clustering works in modern ML commercial ML, KNN not the only form of clustering.
But it is still in-context, additional compression that depends on a decider function, or equivalently a composition linearized set shattering parts.
I am very familiar with these and other clustering methods in modern ML, and have been involved in inventing and publishing some such methods myself in various scientific contexts. The paper I cited above only used 3 nearest neighbors as one baseline IIRC; that is why I mentioned KNN. However, even boosted trees failed to reduce the loss as much as the algorithm learned from the data by the decoder transformer.
Here is a fairly good lecture series on graduate level complexity theory that will help understand parts. At least why multiple iterations help but why they also aren't the answer to super human results.
https://youtube.com/playlist?list=PLm3J0oaFux3b8Gg1DdaJOzYNs...
Thanks for the tip, though I’m not sure how complexity theory will explain the impossibility of superhuman results. The main advantage ML methods have over humans is that they train much faster. Just like humans, they get better with more training. When they are good enough, they can be used to generate synthetic data, especially for cases like software optimization, when it is possible to verify the ground truth. A system could only be correct once in a thousand times to be useful for generating training data as long as we can reliably eliminate all failures. Modern LLM can be better than that minimal requirement for coding already and o1/o3 can probably handle complicated cases. There are differences between coding and games (where ML is already superhuman in most instances) but they start to blur once the model has a baseline command of language, a reasonable model of the world, and the ability to follow desired specs.
ML is better than biological neurons in some tasks, they are different contexts.
Almost all the performance of say college tests are purely from the pre-training, pattern finding and detection.
Transformers are limited to DLOGTIME-uniform TC0, they can't even do the Boolean circuit value problem.
The ability to use the properties of BPP, does help.
Understanding the power of, and limitations of iteration and improving approximations requires descriptive complexity theory IMHO.
6 replies →
> they just are dealing with next token prediction.
And nuclear power plants are just heating water.