Comment by pama

1 year ago

> I wish people would understand what a large language model is.

I think your view of llm does not explain the learning of algorithms that these constructs are clearly capable of, see for example: https://arxiv.org/abs/2208.01066

More generally, the best way to compress information from too many different coding examples is to figure out how to code rather than try to interpolate between existing blogs and QA forums.

My own speculation is that with additional effort during training (RL or active learning in the training loop) we will probably reach superhuman coding performance within two years. I think that o3 is still imperfect but not very far from that point.

To the downvoters: I am curious if the downvoting is because of my speculation, or because of the difference in understanding of decoder transformer models. Thanks!

  • Because you cite is about:

    > in-context learning

    LLMs have no concept of the symantic meaning of what they do, they just are dealing with next token prediction.

    "in-context learning" is the problem, not the solution to general programming tasks.

    Memoryless, ergodic, sub Turing complete problems are a very tiny class.

    Think about how the Entscheidungsproblem relates to halting or the frame problem and the specification problem may be a path.

    But that paper isn't solving the problem at hand.

    • My main concern with the simplification of memorization or near neighbor interpolation that is commonly assumed for LLMs is that these methods are ineffective at scale and unlikely to be used by decoder transformers in practice. That paper shows that the decoder transformer somehow came up with a better decision tree fitting algorithm for low data cases than any of the conventional or boosted tree solutions humans typically use from XGBoost or similar libraries. It also matched the best known algorithms for sparse linear systems. All this while training on sequences of random x1, y1, x2, y2,.. with y for each sequence generated by a new random function of a high-dimensional input x every time. The authors show that KNN does not cut it, and even suboptimal algorithms do not suffice. Not sure what else you need as evidence that decoder transformers can use programs to compress information.

      2 replies →

    • > they just are dealing with next token prediction.

      And nuclear power plants are just heating water.

  • Probably the latter - LLM's are trained to predict the training set, not compress. They will generalize to some degree, but that happens naturally as part of the training dynamics (it's not explicitly rewarded), and only to extent it doesn't increase prediction errors.

    • I agree. However, my point is that they have to compress information in nontrivial ways to achieve their goal. The typical training set of modern LLMs is about 20 trillion tokens of 3 bytes each. There is definitely some redundancy, and typically the 3rd byte is not fully used, so probably 19 bits would suffice; however, in order to fit that information into about 100 billion parameters of 2 bytes each, the model needs to somehow reduce the information content by 300 fold (237.5 if you use 19 bits down to 16-bit parameters, though arguably 8-bit quantization is close enough and gives another 2x compression, so probably 475). A quick check for the llama3.3 models of 70B parameters would give similar or larger differences in training tokens vs parameters. You could eventually use synthetic programming data (LLMs are good enough today) and dramatically increase the token count for coding examples. Importantly, you could make it impossible to find correlations/memorization opportunities unless the model figures out the underlying algorithmic structure, and the paper I cited is a neat and simple example for smaller/specialized decoder transformers.

      4 replies →