Comment by HarHarVeryFunny

1 month ago

That depends on how you define AGI - it's a meaningless term to use since everyone uses it to mean different things. What exactly do you mean ?!

Yes, there is a lot that can be improved via different training, but at what point is it no longer a language model (i.e. something that auto-regressively predicts language continuations)?

I like to use an analogy to the children's "Stone Soup" story whereby a "stone soup" (starting off as a stone in a pot of boiling water) gets transformed into a tasty soup/stew by strangers incrementally adding extra ingredients to "improve the flavor" - first a carrot, then a bit of beef, etc. At what point do you accept that the resulting tasty soup is not in fact stone soup?! It's like taking an auto-regressively SGD-trained Transformer, and incrementally tweaking the architecture, training algorithm, training objective, etc, etc. At some point it becomes a bit perverse to choose to still call it a language model

Some of the "it's just training" changes that would be needed to make today's LLMs more brain-like may be things like changing the training objective completely from auto-regressive to predicting external events (with the goal of having it be able to learn the outcomes of it's own actions, in order to be able to plan them), which to be useful would require the "LLM" to then be autonomous and act in some (real/virtual) world in order to learn.

Another "it's just training" change would be to replace pre/mid/post-training with continual/incremental runtime learning to again make the model more brain-like and able to learn from it's own autonomous exploration of behavior/action and environment. This is a far more profound, and ambitious, change than just fudging incremental knowledge acquisition for some semblance of "on the job" learning (which is what the AI companies are currently working on).

If you put these two "it's just training/learning" enhancements together then you've now got something much more animal/human-like, and much more capable than an LLM, but it's already far from a language model - something that passively predicts next word every time you push the "generate next word" button. This would now be an autonomous agent, learning how to act and control/exploit the world around it. The whole pre-trained, same-for-everyone, model running in the cloud, would then be radically different - every model instance is then more like an individual learning based on it's own experience, and maybe you're now paying for compute for the continual learning compute rather than just "LLM tokens generated".

These are "just" training (and deployment!) changes, but to more closely approach human capability (but again, what to you mean by "AGI"?) there would also need to be architectural changes and additions to the "Transformer" architecture (add looping, internal memory, etc), depending on exactly how close you want to get to human/animal capability.

> which to be useful would require the "LLM" to then be autonomous and act in some (real/virtual) world in order to learn.

You described modern RLVR for tasks like coding. Plug an LLM into a virtual env with a task. Drill it based on task completion. Force it to get better at problem-solving.

It's still an autoregressive next token prediction engine. 100% LLM, zero architectural changes. We just moved it past pure imitation learning and towards something else.

  • Yes, if all you did was replace current pre/mid/post training with a new (elusive holy grail) runtime continual learning algorithm, then it would definitely still just be a language model. You seem to be talking about it having TWO runtime continual learning algorithms, next-token and long-horizon RL, but of course RL is part of what we're calling an LLM.

    It's not obvious if you just did this without changing the learning objective from self-prediction (auto-regressive) to external prediction whether you'd actually gain much capability though. Auto-regressive training is what makes LLMs imitators - always trying to do same as before.

    In fact, if you did just let a continual learner autonomously loose in some virtual environment, why would you expect it do do anything different, other than continual learning from whatever it was exposed to in the environment, from putting a current LLM in a loop, together with tool use as a way to expose it to new data? An imitative (auto-regressive) LLM doesn't have any drive to do anything new - if you just keep feeding it's own output back in as an input, then it's basically just a dynamical system that will eventually settle down into some attractor states representing the closure of the patterns it has learnt and is generating.

    If you want the model to behave in a more human/animal-like self-motivated agentic fashion, then I think the focus has to be on learning how to act to control and take advantage of the semi-predictable environment, which is going to be based on having predicting the environment as the learning objective (vs auto-regressive), plus some innate drives (curiosity, boredom, etc) to bias behavior to maximize learning and creative discovery.

    Continual learning also isn't going to magically solve the RL reward problem (how do you define and measure RL rewards in the general, non-math/programming, case?). In fact post-training is a very human-curated affair since humans have identified math and programming as tasks where this works and have created these problem-specific rewards. If you wanted the model to discover it's own rewards at runtime, as part of your new runtime RL algorithm perhaps, then you'd have to figure how to bake that into the architecture.

    • No. There are no architectural changes and no "second runtime learning algorithm". There's just the good old in-context learning that all LLMs get from pre-training. RLVR is a training stage that pressures the LLM to take advantage of it on real tasks.

      "Runtime continual learning algorithm" is an elusive target of questionable desirability - given that we already have in-context learning, and "get better at SFT and RLVR lmao" is relatively simple to pull off and gives kickass gains in the here and now.

      I see no reason why "behave in a more human/animal-like self-motivated agentic fashion" can't be obtained from more RLVR, if that's what you want to train your LLMs for.

      7 replies →