← Back to context

Comment by ares623

2 days ago

Sorry, not directed at you specifically. But every time I see questions like this I can’t help but rephrase in my head:

“Did you try running it over and over until you got the results you wanted?”

This is not a good analogy because reasoning models are not choosing the best from a set of attempts based on knowledge of the correct answer. It really is more like what it sounds like: “did you think about it longer until you ruled out various doubts and became more confident?” Of course nobody knows quite why directing more computation in this way makes them better, and nobody seems to take the reasoning trace too seriously as a record of what is happening. But it is clear that it works!

  • > Of course nobody knows quite why directing more computation in this way makes them better, and nobody seems to take the reasoning trace too seriously as a record of what is happening. But it is clear that it works!

    One thing it's hard to wrap my head around is that we are giving more and more trust to something we don't understand with the assumption (often unchecked) that it just works. Basically your refrain is used to justify all sorts of odd setup of AIs, agents, etc.

    • Trusting things to work based on practical experience and without formal verification is the norm rather than the exception. In formal contexts like software development people have the means to evaluate and use good judgment.

      I am much more worried about the problem where LLMs are actively misleading low-info users into thinking they’re people, especially children and old people.

  • Bad news: it doesn't seem to work as well as you might think: https://arxiv.org/pdf/2508.01191

    As one might expect, because the AI isn't actually thinking, it's just spending more tokens on the problem. This sometimes leads to the desired outcome but the phenomenon is very brittle and disappears when the AI is pushed outside the bounds of its training.

    To quote their discussion, "CoT is not a mechanism for genuine logical inference but rather a sophisticated form of structured pattern matching, fundamentally bounded by the data distribution seen during training. When pushed even slightly beyond this distribution, its performance degrades significantly, exposing the superficial nature of the “reasoning” it produces."

    • I keep wondering whether people have actually examined how this work draws its conclusions before citing it.

      This is science at its worst, where you start at an inflammatory conclusion and work backwards. There is nothing particularly novel presented here, especially not in the mathematics; obviously performance will degrade on out-of-distribution tasks (and will do so for humans under the same formulation), but the real question is how out-of-distribution a lot of tasks actually are if they can still be solved with CoT. Yes, if you restrict the dataset, then it will perform poorly. But humans already have a pretty large visual dataset to pull from, so what are we comparing to here? How do tiny language models trained on small amounts of data demonstrate fundamental limitations?

      I'm eager to see more works showing the limitations of LLM reasoning, both at small and large scale, but this ain't it. Others have already supplied similar critiques, so let's please stop sharing this one around without the grain of salt.

      6 replies →

    • A viable consideration is that the models will hone in on and reinforce an incorrect answer - a natural side effect of the LLM technology wanting to push certain answers higher in probability and repeat anything in context.

      Regardless of being in conversation or thinking context this doesn't prevent the model from speaking the wrong answer so the paper on the illusion of thinking makes sense.

      What actually seems to be happening is a form of conversational prompting. Of course with the right conversation back and forth with an LLM you can inject knowledge in a way that causes the natural distribution to shift (again - side effect of the LLM tech.) but by itself it won't naturally get the answer perfect every time.

      If this extended thinking were actually working you would expect the LLM to be able to logically conclude an answer with very high accuracy 100% of the time which it does not.

    • The other commenter is more articulate, but you simply cannot draw the conclusion from this paper that reasoning models don't work well. They trained tiny little models and showed they don't work. Big surprise! Meanwhile every other piece of evidence available shows that reasoning models are more reliable at sophisticated problems. Just a few examples.

      - https://arcprize.org/leaderboard

      - https://aider.chat/docs/leaderboards/

      - https://arstechnica.com/ai/2025/07/google-deepmind-earns-gol...

      Surely the IMO problems weren't "within the bounds" of Gemini's training data.

      3 replies →

    • They experimented with gpt-2 scale models. Hard to make any meaningful conclusions in the gpt-5 era.

    • This is not the slam dunk you think it is. Thinking longer genuinely provides better accuracy. Sure there are decreasing returns to increasing thinking tokens.

      GPT 5 fast gets many things wrong but switching to the thinking model fixes the issues very often.

What you describe is a person selecting the best results, but if you can get better results one shot with that option enabled, it’s worth testing and reporting results.

  • I get that. But then if that option doesn't help, what I've seen is that the next followup is inevitably "have you tried doing/prompting x instead of y"

    • It can be summarized as "Did you RTFM?". One shouldn't expect optimal results if the time and effort wasn't invested in learning the tool, any tool. LLMs are no different. GPT-5 isn't one model, it's 6: gpt-5, gpt-5 mini, gpt-nano. Each takes high|medium|low configurations. Anyone who is serious about measuring model capability would go for the best configuration, especially in medicine.

      I skimmed through the paper and I didnt see any mention of what parameters they used other than they use gpt-5 via the API.

      What was the reasoning_effort? verbosity? temperature?

      These things matter.

    • Something I've experienced with multiple new model releases is plugging them into my app makes my app worse. Then I do a bunch of work on prompts and now my app is better than ever. And it's not like the prompts are just better and make the old model work better too - usually the new prompts make the old model worse or there isn't any change.

      So it makes sense to me that you should try until you get the results you want (or fail to do so). And it makes sense to ask people what they've tried. I haven't done the work yet to try this for gpt5 and am not that optimistic, but it is possible it will turn out this way again.

    • > I get that. But then if that option doesn't help, what I've seen is that the next followup is inevitably "have you tried doing/prompting x instead of y"

      Maybe I’m misunderstanding, but it sounds like you’re framing a completely normal proces (try, fail, adjust) as if it’s unreasonable?

      In reality, when something doesn’t work, it would seem to me that the obvious next step is to adapt and try again. This does not seem like a radical approach but instead seems to largely be how problem solving sort of works?

      For example, when I was a kid trying to push start my motorcycle, it wouldn’t fire no matter what I did. Someone suggested a simple tweak, try a different gear. I did, and instantly the bike roared to life. What I was doing wasn’t wrong, it just needed a slight adjustment to get the result I was after.

      3 replies →