Comment by thegeomaster

1 day ago

Exactly. LLMs are gullible. They will believe anything you tell them, including incorrect things they have told themselves. This amplifies errors greatly, because they don't have the capacity to step back and try a different approach, or introspect why they failed. They need actual guidance from somebody with much common sense; if let loose in the world, they mostly just spin around in circles because they don't have this executive intelligence.

A regular single-pass LLM indeed cannot step back, but newer ones like o1/o3/Marco-o1/QwQ can, and a larger agentic system composed of multiple LLMs definitely can. There is no "fundamental" limitation here. And once we start training these larger systems from the ground up via full reinforcement learning (rather than composing existing models), the sky's the limit. I'd be very bullish about Deepmind, once they fully enter this race.

  • > And once we start training these larger systems from the ground up via full reinforcement learning (rather than composing existing models),

    Agree with this totally.

    I wouldn't call what the CoT models are doing exactly being able to step back - their "stepping back" still dumps tokens into the output, so it is still burdened with seeing all of these failed attempts as it searches for the right one. But my intuition on this can be wrong, and it's a much more advanced reasoning process than what "last-gen" (non-CoT) does, so I can see your point.

    For an agentic system composed of multiple LLMs, I would strongly disagree if the LLMs are last-gen. In my experience, it is very hard to prompt a non-CoT LLM into rejecting an upstream assumption without making it paranoid and rejecting valid assumptions as well. This makes it hard to effectively create a robust agentic system that can self-correct.

    I think that's different if the agents are o1-level, but I think it's hard to appreciate just how costly and slow doing this would be. Agents consume tokens like candy with all the back-and-forth, so a surprising number of tasks become economically infeasible.

    (It seems everyone is waiting for an inference perf breakthrough that may or may not come.)