Comment by vrighter
2 months ago
throwing a pair of dice and getting exactly 2 can also happen on the first try. Doesn't mean the dice are a 1+1 calculating machine
2 months ago
throwing a pair of dice and getting exactly 2 can also happen on the first try. Doesn't mean the dice are a 1+1 calculating machine
I guess my point is that the parent comment says LLMs get this wrong, but presents no evidence for that, and two anecdotes disagree. The next step is to see some evidence to the contrary.
> LLMs get this wrong
I wrote that of «a dozen models, no one could count». All of those I tried, with reasoning or not.
> presents no evidence
Create an environment to test and look for the failures. System prompt like "count this, this and that in the input"; user prompt some short paragraph. Models, the latest open weights.
> two anecdotes disagree
There is a strong asymmetry between verification and falsification. Said falsification occurred in a full set of selected LLMs - a lot. If two classes are there, the failing class is numerous and the difference between the two must be pointed at clearly. Also since we believe that the failure will be exported beyond the case of counting.