Comment by pama
1 year ago
To the downvoters: I am curious if the downvoting is because of my speculation, or because of the difference in understanding of decoder transformer models. Thanks!
1 year ago
To the downvoters: I am curious if the downvoting is because of my speculation, or because of the difference in understanding of decoder transformer models. Thanks!
Because you cite is about:
> in-context learning
LLMs have no concept of the symantic meaning of what they do, they just are dealing with next token prediction.
"in-context learning" is the problem, not the solution to general programming tasks.
Memoryless, ergodic, sub Turing complete problems are a very tiny class.
Think about how the Entscheidungsproblem relates to halting or the frame problem and the specification problem may be a path.
But that paper isn't solving the problem at hand.
My main concern with the simplification of memorization or near neighbor interpolation that is commonly assumed for LLMs is that these methods are ineffective at scale and unlikely to be used by decoder transformers in practice. That paper shows that the decoder transformer somehow came up with a better decision tree fitting algorithm for low data cases than any of the conventional or boosted tree solutions humans typically use from XGBoost or similar libraries. It also matched the best known algorithms for sparse linear systems. All this while training on sequences of random x1, y1, x2, y2,.. with y for each sequence generated by a new random function of a high-dimensional input x every time. The authors show that KNN does not cut it, and even suboptimal algorithms do not suffice. Not sure what else you need as evidence that decoder transformers can use programs to compress information.
Littlestone and Warmuth make the connection to compression in1986, which was later shown to be equivalent to VC dimensionally or PAC learnablilty.
Look into DBScan, OPTICs for far closer lenses on how clustering works in modern ML commercial ML, KNN not the only form of clustering.
But it is still in-context, additional compression that depends on a decider function, or equivalently a composition linearized set shattering parts.
1 reply →
Here is a fairly good lecture series on graduate level complexity theory that will help understand parts. At least why multiple iterations help but why they also aren't the answer to super human results.
https://youtube.com/playlist?list=PLm3J0oaFux3b8Gg1DdaJOzYNs...
Thanks for the tip, though I’m not sure how complexity theory will explain the impossibility of superhuman results. The main advantage ML methods have over humans is that they train much faster. Just like humans, they get better with more training. When they are good enough, they can be used to generate synthetic data, especially for cases like software optimization, when it is possible to verify the ground truth. A system could only be correct once in a thousand times to be useful for generating training data as long as we can reliably eliminate all failures. Modern LLM can be better than that minimal requirement for coding already and o1/o3 can probably handle complicated cases. There are differences between coding and games (where ML is already superhuman in most instances) but they start to blur once the model has a baseline command of language, a reasonable model of the world, and the ability to follow desired specs.
7 replies →
> they just are dealing with next token prediction.
And nuclear power plants are just heating water.
Probably the latter - LLM's are trained to predict the training set, not compress. They will generalize to some degree, but that happens naturally as part of the training dynamics (it's not explicitly rewarded), and only to extent it doesn't increase prediction errors.
I agree. However, my point is that they have to compress information in nontrivial ways to achieve their goal. The typical training set of modern LLMs is about 20 trillion tokens of 3 bytes each. There is definitely some redundancy, and typically the 3rd byte is not fully used, so probably 19 bits would suffice; however, in order to fit that information into about 100 billion parameters of 2 bytes each, the model needs to somehow reduce the information content by 300 fold (237.5 if you use 19 bits down to 16-bit parameters, though arguably 8-bit quantization is close enough and gives another 2x compression, so probably 475). A quick check for the llama3.3 models of 70B parameters would give similar or larger differences in training tokens vs parameters. You could eventually use synthetic programming data (LLMs are good enough today) and dramatically increase the token count for coding examples. Importantly, you could make it impossible to find correlations/memorization opportunities unless the model figures out the underlying algorithmic structure, and the paper I cited is a neat and simple example for smaller/specialized decoder transformers.
It's hard to know where to start ...
A transformer is not a compressor. It's a transformer/generator. It'll generate a different output for an infinite number of different inputs. Does that mean it's got an infinite storage capacity?
The trained parameters of a transformer are not a compressed version of the training set, or of the information content of the training set; they are a configuration of the transformer so that its auto-regressive generative capabilities are optimized to produce the best continuation of partial training set samples that it is capable of.
Now, are there other architectures, other than a transformer, that might do a better job, or more efficient one (in terms of # parameters) at predicting training set samples, or even of compressing the information content of the training set? Perhaps, but we're not talking hypotheticals, we're talking about transformers (or at least most of us are).
Even if a transformer was a compression engine, which it isn't, rather than a generative architecture, why would you think that the number of tokens in the training set is a meaningful measure/estimate of it's information content?!! Heck, you go beyond that to considering a specific tokenization scheme and number bits/bytes per token, all of which it utterly meaningless! You may as well just count number of characters, or words, or sentences for that matter, in the training set, which would all be equally bad ways to estimate it's information content, other than sentences perhaps having at least some tangential relationship to it.
sigh
You've been downvoted because you're talking about straw men, and other people are talking about transformers.
3 replies →