Comment by mdp2021
2 months ago
I have done this test extensively days ago, on a dozen models: no one could count - all of them got results wrong, all of them suggested they can't check and will just guess.
Until they will be able of procedural thinking they will be radically, structurally unreliable. Structurally delirious.
And it is also a good thing that we can check in this easy way - if the producers patched the local fault only, then the absence of procedural thinking would not be clear, and we would need more sophisticated ways to check.
If you think about the architecture, how is a decoder transformer supposed to count? It is not magic. The weights must implement some algorithm.
Take a task where a long paragraph contains the word "blueberry" multiple times, and at the end, a question asks how many times blueberry appears. If you tried to solve this in one shot by attending to every "blueberry," you would only get an averaged value vector for matching keys, which is useless for counting.
To count, the QKV mechanism, the only source of horizontal information flow, would need to accumulate a value across tokens. But since the question is only appended at the end, the model would have to decide in advance to accumulate "blueberry" counts and store them in the KV cache. This would require layer-wise accumulation, likely via some form of tree reduction.
Even then, why would the model maintain this running count for every possible question it might be asked? The potential number of such questions is effectively limitless.
Did you enable reasoning? Qwen3 32b with reasoning enabled gave me the correct answer on the first attempt.
> Did you enable reasoning
Yep.
> gave me the correct answer
Try real-world tests that cannot be covered by training data or chancey guesses.
Counting letters is a known blindspot in LLMs because of how tokenization works in most LLMs - they don't see individual letters. I'm not sure it's a valid test to make any far-reaching conclusions about their intelligence. It's like saying a blind person is an absolute dumbass just because they can't tell green from red.
The fact that reasoning models can count letters, even though they can't see individual letters, is actually pretty cool.
>Try real-world tests that cannot be covered by training data
If we don't allow a model to base its reasoning on the training data it's seen, what should it base it on? Clairvoyance? :)
> chancey guesses
The default sampling in most LLMs uses randomness to feel less robotic and repetitive, so it’s no surprise it makes “chancey guesses.” That’s literally what the system is programmed to do by default.
3 replies →
2b granite model can do this in first attempt
ollama run hf.co/ibm-granite/granite-3.3-2b-instruct-GGUF:F16 >>> how many b’s are there in blueberry? The word "blueberry" contains two 'b's.
I did include granite (8b) in my mentioned tests. You suggest granite-3.3-2b-instruct, no prob.
response:
All wrong.
Sorry I did not have the "F16" available
So did Deepseek. I guess the Chinese have figured out something the West hasn't, how to count.
No, DeepSeek also fails. (It worked in your test - it failed in similar others.)
(And note that DeepSeek can be very dumb - in practice, as experienced in our practice, and in standard tests, where it shows an ~80 IQ, where with other tools we achieved ~120 IQ (trackingai.org). DeepSeek was in important step, a demonstration of potential for efficiency, a gift - but it is still part of the collective work in progress.)
https://claude.ai/share/e7fc2ea5-95a3-4a96-b0fa-c869fa8926e8
It's really not hard to get them to reach the correct answer on this class of problems. Want me to have it spell it backwards and strip out the vowels? I'll be surprised if you can find an example this model can't one shot.
(Can't see it now because of maintenance but of course I trust it - that some get it right is not the issue.)
> if you can find an example this model can't
Then we have a problem of understanding why some work and some do not, and we have a due diligence crucial problem of determining whether the class of issues indicated by the possibility of fault as shown by many models are fully overcome in the architectures of those which work, or whether the boundaries of the problem are just moved but still tainting other classes of results.
Gemini 2.5 Flash got it right for me first time.
It’s just a few anecdotes, not data, but that’s two examples of first time correctness so certainly doesn’t seem like luck. If you have more general testing data on this I’m keen to see the results and methodology though.
throwing a pair of dice and getting exactly 2 can also happen on the first try. Doesn't mean the dice are a 1+1 calculating machine
I guess my point is that the parent comment says LLMs get this wrong, but presents no evidence for that, and two anecdotes disagree. The next step is to see some evidence to the contrary.
1 reply →
I tested it the other day and Claude with Reasoning got it correct every time
The interesting point is that many fail (100% in the class I had to select), and that raises the question of the difference between the pass-class and fail-class, and the even more important question of the solution inside the pass-class being contextual or definitive.