← Back to context

Comment by danpalmer

2 months ago

Gemini 2.5 Flash got it right for me first time.

It’s just a few anecdotes, not data, but that’s two examples of first time correctness so certainly doesn’t seem like luck. If you have more general testing data on this I’m keen to see the results and methodology though.

throwing a pair of dice and getting exactly 2 can also happen on the first try. Doesn't mean the dice are a 1+1 calculating machine

  • I guess my point is that the parent comment says LLMs get this wrong, but presents no evidence for that, and two anecdotes disagree. The next step is to see some evidence to the contrary.

    • > LLMs get this wrong

      I wrote that of «a dozen models, no one could count». All of those I tried, with reasoning or not.

      > presents no evidence

      Create an environment to test and look for the failures. System prompt like "count this, this and that in the input"; user prompt some short paragraph. Models, the latest open weights.

      > two anecdotes disagree

      There is a strong asymmetry between verification and falsification. Said falsification occurred in a full set of selected LLMs - a lot. If two classes are there, the failing class is numerous and the difference between the two must be pointed at clearly. Also since we believe that the failure will be exported beyond the case of counting.