Comment by og_kalu

2 days ago

1. What you're generally describing is a well known failure mode for humans as well. Even when it "failed" the riddle tests, substituting the words or morphing the question so it didn't look like a replica of the famous problem usually did the trick. I'm not sure what your point is because you can play this gotcha on humans too.

2. You just demonstrated GPT-5 has 99.9% accuracy on unforseen 15 digit multiplication and your conclusion is "fancy pattern matching" ? Really ? Well I'm not sure you could do better so your example isn't really doing what you hoped for.

Humans can break things down and work through them step by step. The LLMs one-shot pattern match. Even the reasoning models have been shown to do just that. Anthropic even showed that the reasoning models tended to work backwards: one shotting an answer and then matching a chain of thought to it after the fact.

If a human is capable of multiplying double digit numbers, they can also multiple those large ones. The steps are the same, just repeated many more times. So by learning the steps of long multiplication, you can multiply any numbers with enough patience. The LLM doesn’t scale like this, because it’s not doing the steps. That’s my point.

A human doesn’t need to have seen the 15 digits before to be able to calculate them, because a human can follow the procedure to calculate. GPT’s answer was orders of magnitude off. It resembles the right answer superficially but it’s a very different result.

The same applies to the riddles. A human can apply logical steps. The LLM either knows or it doesn’t.

Maybe my examples weren’t the best. I’m sorry for not being better at articulating it, but I see this daily as I interact with AI, it has a superficial “understanding” where if what I ask happens to be close to something it’s trained on, it gets good results, but it has no critical thinking, no step by step reasoning (even the “reasoning models”), and it repeats the same mistakes even when explicitly told up front not to make them.

  • >Humans can break things down and work through them step by step. The LLMs one-shot pattern match.

    I've had LLMs break down problems and work through them, pivot when errors arise and all that jazz. They're not perfect at it and they're worse than humans but it happens.

    >Anthropic even showed that the reasoning models tended to work backwards: one shotting an answer and then matching a chain of thought to it after the fact.

    This is also another failure mode that occurs in humans. A number of experiments suggest human explanations are often post hoc rationalizations even when they genuinely believe otherwise.

    >If a human is capable of multiplying double digit numbers, they can also multiple those large ones.

    Yeah, and some of them will make mistakes, and some of them will be less accurate than GPT-5. We didn't switch to calculators and spreadsheets just for the fun of it.

    >GPT’s answer was orders of magnitude off. It resembles the right answer superficially but it’s a very different result.

    GPT-5 on the site is a router that will give you who knows what model so I tried your query with the API directly (GPT-5 medium thinking) and it gave me:

    9.207337461477596e+27

    When prompted to give all the numbers, it returned:

    9,207,337,461,477,596,127,977,612,004.

    You can replicate this if you use the API. Honestly I'm surprised. I didn't realize State of the Art had become this precise.

    Now what ? Does this prove you wrong ?

    This is kind of the problem. There's no sense in making gross generalizations, especially off behavior that also manifests in humans.

    LLMs don't understand some things well. Why not leave it at that?

    • Here is how GPT self-described LLM reasoning when I asked about it:

          - LLMs don’t “reason” in the symbolic, step‑by‑step sense that humans or logic engines do. They don’t manipulate abstract symbols with guaranteed consistency.
          - What they do have is a statistical prior over reasoning traces: they’ve seen millions of examples of humans doing step‑by‑step reasoning (math proofs, code walkthroughs, planning text, etc.).
          - So when you ask them to “think step by step,” they’re not deriving logic — they’re imitating the distribution of reasoning traces they’ve seen.
      
          This means:
      
          - They can often simulate reasoning well enough to be useful.
          - But they’re not guaranteed to be correct or consistent.
      

      That at least sounds consistent with what I’ve been trying to say and what I’ve observed.