Comment by Terr_

5 days ago

... Consider this exchange:

Alice: "Bob, I know you're very proud about your neural network calculator app, but it keeps occasionally screwing up with false algebra results. There's no reason to think this new architecture will reliably do all the math we need."

Bob: "How dare you! What algebra have humans been verified to always succeed-at which my program doesn't?! Huh!? HUH!?"

___________

Bob's challenge, like yours, is not relevant. The (im)perfection of individual humans doesn't change the fact that the machine we built to do things for us is giving bad results.

It's not irrelevant, because this is an argument about whether the machine can be said to be reasoning or not.

If Alice had concluded that this occasional mistake NN calculator was 'not really performing algebra', then Bob would be well within his rights to ask Alice what on earth she was going on about.

  • > If Alice had concluded that this occasional mistake NN calculator was 'not really performing algebra', then Bob would be well within his rights to ask Alice what on earth she was going on about.

    No, your burden of proof here is totally bass-ackwards.

    Bob's the one who asked for blind trust that his magical auto-learning black-box would be made to adhere to certain rules... but the rules and trust are broken. Bob's the one who has to start explaining the discrepancy, and whether the failure is (A) a fixable bug or (B) an unfixable limitation that can be reliably managed or (C) an unfixable problem with no good mitigation.

    > It's not irrelevant, because this is an argument about whether the machine can be said to be reasoning or not.

    Bringing up "b-b-but homo sapiens" is only "relevant" if you're equivocating the meaning of "reasoning", using it in a broad, philosophical, and kinda-unprovable sense.

    In contrast, the "reasoning" we actually wish LLMs would do involves capabilities like algebra, syllogisms, deduction, and the CS-classic boolean satisfiability.

    However the track-record of LLMs on such things is long and clear: They fake it, albeit impressively.

    The LLM will finish the popular 2+2=_, and we're amazed, but when we twiddle the operands too far, it gives nonsense. It answers "All men are mortal. Socrates is a man. Therefore, Socrates is ______", but reword the situation enough and it breaks again.

    • >Bob's the one who asked for blind trust that his magical auto-learning black-box would be made to adhere to certain rules... but the rules and trust are broken.

      This is the problem with analogies. Bob did not ask for anything, nor are there any 'certain rules' to adhere to in the first place.

      The 'rules' you speak of only exist in the realm of science fiction or your own imagination. Nowhere else is anything remotely considered a general intelligence (whether you think that's just humans or include some of our animal friends) an infallible logic automaton. It literally does not exist. Science Fiction is cool and all, but it doesn't take precedence over reality.

      >Bringing up "b-b-but homo sapiens" is only "relevant" if you're equivocating the meaning of "reasoning", using it in a broad, philosophical, and kinda-unprovable sense.

      You mean the only sense that actually exists ? Yes. It's also not 'unprovable' in the sense I'm asking about. Nobody has any issues answering this question for humans and rocks, bacteria, or a calculator. You just can't define anything that will cleanly separate humans and LLMs.

      >In contrast, the "reasoning" we actually wish LLMs would do involves capabilities like algebra, syllogisms, deduction, and the CS-classic boolean satisfiability.

      Yeah, and they're capable of doing all of those things. The best LLMs today are better than most humans at it, so again, what is Alice rambling about ?

      >The LLM will finish the popular 2+2=_, and we're amazed, but when we twiddle the operands too far, it gives nonsense.

      Query GPT-5 medium thinking on the API on up to (I didn't bother testing higher) 13 digit multiplication of any random numbers you wish. Then watch it get it exactly right.

      Weeks ago, I got Gemini 2.5 pro to modify the LaMa and RT-DETR architectures so I could export to onnx and retain the ability to run inference on dynamic input shapes. This was not a trivial exercise.

      >It answers "All men are mortal. Socrates is a man. Therefore, Socrates is ______", but reword the situation enough and it breaks again.

      Do you actual have an example of a reword SOTA models fail at ?

      4 replies →