Comment by og_kalu

2 days ago

>Humans can break things down and work through them step by step. The LLMs one-shot pattern match.

I've had LLMs break down problems and work through them, pivot when errors arise and all that jazz. They're not perfect at it and they're worse than humans but it happens.

>Anthropic even showed that the reasoning models tended to work backwards: one shotting an answer and then matching a chain of thought to it after the fact.

This is also another failure mode that occurs in humans. A number of experiments suggest human explanations are often post hoc rationalizations even when they genuinely believe otherwise.

>If a human is capable of multiplying double digit numbers, they can also multiple those large ones.

Yeah, and some of them will make mistakes, and some of them will be less accurate than GPT-5. We didn't switch to calculators and spreadsheets just for the fun of it.

>GPT’s answer was orders of magnitude off. It resembles the right answer superficially but it’s a very different result.

GPT-5 on the site is a router that will give you who knows what model so I tried your query with the API directly (GPT-5 medium thinking) and it gave me:

9.207337461477596e+27

When prompted to give all the numbers, it returned:

9,207,337,461,477,596,127,977,612,004.

You can replicate this if you use the API. Honestly I'm surprised. I didn't realize State of the Art had become this precise.

Now what ? Does this prove you wrong ?

This is kind of the problem. There's no sense in making gross generalizations, especially off behavior that also manifests in humans.

LLMs don't understand some things well. Why not leave it at that?

Here is how GPT self-described LLM reasoning when I asked about it:

    - LLMs don’t “reason” in the symbolic, step‑by‑step sense that humans or logic engines do. They don’t manipulate abstract symbols with guaranteed consistency.
    - What they do have is a statistical prior over reasoning traces: they’ve seen millions of examples of humans doing step‑by‑step reasoning (math proofs, code walkthroughs, planning text, etc.).
    - So when you ask them to “think step by step,” they’re not deriving logic — they’re imitating the distribution of reasoning traces they’ve seen.

    This means:

    - They can often simulate reasoning well enough to be useful.
    - But they’re not guaranteed to be correct or consistent.

That at least sounds consistent with what I’ve been trying to say and what I’ve observed.