Comment by kgeist
2 months ago
Did you enable reasoning? Qwen3 32b with reasoning enabled gave me the correct answer on the first attempt.
2 months ago
Did you enable reasoning? Qwen3 32b with reasoning enabled gave me the correct answer on the first attempt.
> Did you enable reasoning
Yep.
> gave me the correct answer
Try real-world tests that cannot be covered by training data or chancey guesses.
Counting letters is a known blindspot in LLMs because of how tokenization works in most LLMs - they don't see individual letters. I'm not sure it's a valid test to make any far-reaching conclusions about their intelligence. It's like saying a blind person is an absolute dumbass just because they can't tell green from red.
The fact that reasoning models can count letters, even though they can't see individual letters, is actually pretty cool.
>Try real-world tests that cannot be covered by training data
If we don't allow a model to base its reasoning on the training data it's seen, what should it base it on? Clairvoyance? :)
> chancey guesses
The default sampling in most LLMs uses randomness to feel less robotic and repetitive, so it’s no surprise it makes “chancey guesses.” That’s literally what the system is programmed to do by default.
> they don't see individual letters
Yet they seem to be from many other tests (characters corrections or manipulation in texts, for example).
> The fact that reasoning models can count letters, even though they can't see individual letters
To a mind, every idea is a representation. But we want the processor to work reliably on them representations.
> If we don't allow a [mind] to base its reasoning on the training data it's seen, what should it base it on
On its reasoning and judgement over what it was told. You do not repeat what you heard, or you state that's what you heard (and provide sources).
> uses randomness
That is in a way a problem, a non-final fix - satisficing (Herb Simon) after random germs instead of constructing through a full optimality plan.
In the way I used the expression «chancey guesses» though I meant that guessing by chance when the right answer falls in a limited set ("how many letters in 'but'") is a weaker corroboration than when the right answer falls in a richer set ("how many letters in this sentence").
2 replies →
2b granite model can do this in first attempt
ollama run hf.co/ibm-granite/granite-3.3-2b-instruct-GGUF:F16 >>> how many b’s are there in blueberry? The word "blueberry" contains two 'b's.
I did include granite (8b) in my mentioned tests. You suggest granite-3.3-2b-instruct, no prob.
response:
All wrong.
Sorry I did not have the "F16" available
So did Deepseek. I guess the Chinese have figured out something the West hasn't, how to count.
No, DeepSeek also fails. (It worked in your test - it failed in similar others.)
(And note that DeepSeek can be very dumb - in practice, as experienced in our practice, and in standard tests, where it shows an ~80 IQ, where with other tools we achieved ~120 IQ (trackingai.org). DeepSeek was in important step, a demonstration of potential for efficiency, a gift - but it is still part of the collective work in progress.)