← Back to context

Comment by AlphaAndOmega0

17 hours ago

Previous models from competitors usually got that correct, and the reasoning versions almost always did.

This kind of reflexive criticism isn't helpful, it's closer to a fully generalized counter-argument against LLM progress, whereas it's obvious to anyone that models today can do things they couldn't do six months ago, let alone 2 years back.

I'm not denying any progress, I'm saying that reasoning failures that are simple which have gone viral are exactly the kind of thing that they will toss in the training data. Why wouldn't they? There's real reputational risks in not fixing it and no costs in fixing it.

  • Given that Gemini 3 Pro already did solid on that test, what exactly did they improve? Why would they bother?

    I double checked and tested on AI Studio, since you can still access the previous model there:

    >You should drive. >If you walk there, your car will stay behind, and you won't be able to wash it.

    Thinking models consistently get it correct and did when the test was brand new (like a week or two ago). It is the opposite of surprising that a new thinking model continues getting it correct, unless the competitors had a time machine.

    • Why would they bother? Because it costs essentially nothing to add it to the training data. My point is that once a reasoning example becomes sufficiently viral, it ceases to be a good test because companies have a massive incentive to correct it. The fact some models got it right before (unreliably) doesn't mean they wouldn't want to ensure that the model gets it right.