Comment by dkersten

2 days ago

You can trivially demonstrate that its just a very complex and fancy pattern matcher: "if prompt looks something like this, then response looks something like that".

You can demonstrate this by eg asking it mathematical questions. If its seen them before, or something similar enough, it'll give you the correct answer, if it hasn't, it gives you a right-ish-looking yet incorrect answer.

For example, I just did this on GPT-5:

    Me: what is 435 multiplied by 573?
    GPT-5: 435 x 573 = 249,255

This is correct. But now lets try it with numbers its very unlikely to have seen before:

    Me: what is 102492524193282 multiplied by 89834234583922?
    GPT-5: 102492524193282 x 89834234583922 = 9,205,626,075,852,076,980,972,804

Which is not the correct answer, but it looks quite similar to the correct answer. Here is GPT's answer (first one) and the actual correct answer (second one):

    9,205,626,075,852,076,980,972,    804
    9,207,337,461,477,596,127,977,612,004

They sure look kinda similar, when lined up like that, some of the digits even match up. But they're very very different numbers.

So its trivially not "real thinking" because its just an "if this then that" pattern matcher. A very sophisticated one that can do incredible things, but a pattern matcher nonetheless. There's no reasoning, no step by step application of logic. Even when it does chain of thought.

To try give it the best chance, I asked it the second one again but asked it to show me the step by step process. It broke it into steps and produced a different, yet still incorrect, result:

    9,205,626,075,852,076,980,972,704

Now, I know that LLM's are language models, not calculators, this is just a simple example that's easy to try out. I've seen similar things with coding: it can produce things that its likely to have seen, but struggles with logically relatively simple but unlikely to have seen things.

Another example is if you purposely butcher that riddle about the doctor/surgeon being the persons mother and ask it incorrectly, eg:

    A child was in an accident. The surgeon refuses to treat him because he hates him. Why?

The LLM's I've tried it on all respond with some variation of "The surgeon is the boy’s father." or similar. A correct answer would be that there isn't enough information to know the answer.

They're for sure getting better at matching things, eg if you ask the river crossing riddle but replace the animals with abstract variables, it does tend to get it now (didn't in the past), but if you add a few more degrees of separation to make the riddle semantically the same but harder to "see", it takes coaxing to get it to correctly step through to the right answer.

5 comments

dkersten

og_kalu 2 days ago

1. What you're generally describing is a well known failure mode for humans as well. Even when it "failed" the riddle tests, substituting the words or morphing the question so it didn't look like a replica of the famous problem usually did the trick. I'm not sure what your point is because you can play this gotcha on humans too.

2. You just demonstrated GPT-5 has 99.9% accuracy on unforseen 15 digit multiplication and your conclusion is "fancy pattern matching" ? Really ? Well I'm not sure you could do better so your example isn't really doing what you hoped for.

dkersten 2 days ago
Humans can break things down and work through them step by step. The LLMs one-shot pattern match. Even the reasoning models have been shown to do just that. Anthropic even showed that the reasoning models tended to work backwards: one shotting an answer and then matching a chain of thought to it after the fact.
If a human is capable of multiplying double digit numbers, they can also multiple those large ones. The steps are the same, just repeated many more times. So by learning the steps of long multiplication, you can multiply any numbers with enough patience. The LLM doesn’t scale like this, because it’s not doing the steps. That’s my point.
A human doesn’t need to have seen the 15 digits before to be able to calculate them, because a human can follow the procedure to calculate. GPT’s answer was orders of magnitude off. It resembles the right answer superficially but it’s a very different result.
The same applies to the riddles. A human can apply logical steps. The LLM either knows or it doesn’t.
Maybe my examples weren’t the best. I’m sorry for not being better at articulating it, but I see this daily as I interact with AI, it has a superficial “understanding” where if what I ask happens to be close to something it’s trained on, it gets good results, but it has no critical thinking, no step by step reasoning (even the “reasoning models”), and it repeats the same mistakes even when explicitly told up front not to make them.
- og_kalu 2 days ago
  
  >Humans can break things down and work through them step by step. The LLMs one-shot pattern match.
  I've had LLMs break down problems and work through them, pivot when errors arise and all that jazz. They're not perfect at it and they're worse than humans but it happens.
  >Anthropic even showed that the reasoning models tended to work backwards: one shotting an answer and then matching a chain of thought to it after the fact.
  This is also another failure mode that occurs in humans. A number of experiments suggest human explanations are often post hoc rationalizations even when they genuinely believe otherwise.
  >If a human is capable of multiplying double digit numbers, they can also multiple those large ones.
  Yeah, and some of them will make mistakes, and some of them will be less accurate than GPT-5. We didn't switch to calculators and spreadsheets just for the fun of it.
  >GPT’s answer was orders of magnitude off. It resembles the right answer superficially but it’s a very different result.
  GPT-5 on the site is a router that will give you who knows what model so I tried your query with the API directly (GPT-5 medium thinking) and it gave me:
  9.207337461477596e+27
  When prompted to give all the numbers, it returned:
  9,207,337,461,477,596,127,977,612,004.
  You can replicate this if you use the API. Honestly I'm surprised. I didn't realize State of the Art had become this precise.
  Now what ? Does this prove you wrong ?
  This is kind of the problem. There's no sense in making gross generalizations, especially off behavior that also manifests in humans.
  LLMs don't understand some things well. Why not leave it at that?
  
  1 reply →