Comment by pama

1 year ago

> I wish people would understand what a large language model is.

I think your view of llm does not explain the learning of algorithms that these constructs are clearly capable of, see for example: https://arxiv.org/abs/2208.01066

More generally, the best way to compress information from too many different coding examples is to figure out how to code rather than try to interpolate between existing blogs and QA forums.

My own speculation is that with additional effort during training (RL or active learning in the training loop) we will probably reach superhuman coding performance within two years. I think that o3 is still imperfect but not very far from that point.

21 comments

pama

pama 1 year ago

To the downvoters: I am curious if the downvoting is because of my speculation, or because of the difference in understanding of decoder transformer models. Thanks!

nyrikki 1 year ago
Because you cite is about:
> in-context learning
LLMs have no concept of the symantic meaning of what they do, they just are dealing with next token prediction.
"in-context learning" is the problem, not the solution to general programming tasks.
Memoryless, ergodic, sub Turing complete problems are a very tiny class.
Think about how the Entscheidungsproblem relates to halting or the frame problem and the specification problem may be a path.
But that paper isn't solving the problem at hand.
- pama 1 year ago
  
  My main concern with the simplification of memorization or near neighbor interpolation that is commonly assumed for LLMs is that these methods are ineffective at scale and unlikely to be used by decoder transformers in practice. That paper shows that the decoder transformer somehow came up with a better decision tree fitting algorithm for low data cases than any of the conventional or boosted tree solutions humans typically use from XGBoost or similar libraries. It also matched the best known algorithms for sparse linear systems. All this while training on sequences of random x1, y1, x2, y2,.. with y for each sequence generated by a new random function of a high-dimensional input x every time. The authors show that KNN does not cut it, and even suboptimal algorithms do not suffice. Not sure what else you need as evidence that decoder transformers can use programs to compress information.
  
  2 replies →
- nyrikki 1 year ago
  
  Here is a fairly good lecture series on graduate level complexity theory that will help understand parts. At least why multiple iterations help but why they also aren't the answer to super human results.
  https://youtube.com/playlist?list=PLm3J0oaFux3b8Gg1DdaJOzYNs...
  
  8 replies →
- Eisenstein 1 year ago
  
  > they just are dealing with next token prediction.
  And nuclear power plants are just heating water.
HarHarVeryFunny 1 year ago
Probably the latter - LLM's are trained to predict the training set, not compress. They will generalize to some degree, but that happens naturally as part of the training dynamics (it's not explicitly rewarded), and only to extent it doesn't increase prediction errors.
- pama 1 year ago
  
  I agree. However, my point is that they have to compress information in nontrivial ways to achieve their goal. The typical training set of modern LLMs is about 20 trillion tokens of 3 bytes each. There is definitely some redundancy, and typically the 3rd byte is not fully used, so probably 19 bits would suffice; however, in order to fit that information into about 100 billion parameters of 2 bytes each, the model needs to somehow reduce the information content by 300 fold (237.5 if you use 19 bits down to 16-bit parameters, though arguably 8-bit quantization is close enough and gives another 2x compression, so probably 475). A quick check for the llama3.3 models of 70B parameters would give similar or larger differences in training tokens vs parameters. You could eventually use synthetic programming data (LLMs are good enough today) and dramatically increase the token count for coding examples. Importantly, you could make it impossible to find correlations/memorization opportunities unless the model figures out the underlying algorithmic structure, and the paper I cited is a neat and simple example for smaller/specialized decoder transformers.
  
  4 replies →