Can modern LLMs count the number of b's in "blueberry"?

2 months ago (minimaxir.com)

The Internet has been poking fun at LLMs failing to count letters for many months. Are the AI companies really living in an echo chamber? They should have implemented a thing to count letters (and to search for palindromes, and…), and just have the LLM pass the request to the count_letters function, as they do for many other tasks already…

  • Why on earth would they do this? This is not a fundamentally useful task; it serves as a measure of the LLM's ability to generalize to tasks outside of its training data and that strain the limits of what it can express.

    • Because optics matter. they are all ultimately fundraising and competing and this is terrible PR

      Ask Jeeves from 1997 could answer this question, so tell me why we need to devote a nation-state amount of compute power to feed an “AI” that confidently gets kindergarten level questions dead ass wrong?

      I have the same kind of question when I watch the AI summary on Google output tokens one-by-one to give me less useful information that is right there on the first search result from Wikipedia (fully sourced, too)

    • If you’re advertising that your new LLM is like a PhD in your pocket, and it fails on a task that a first grader can do, it makes it hard to take your other claims seriously.

  • > Are the AI companies really living in an echo chamber?

    The author tested 12 models, and only one was consistently wrong. More than half were correct 100% of the time.

    A better conclusion would be that there’s something in particular wrong with GPT-5 Chat, all the other GPT 5 variants are OK. I wonder what’s different?

    • This “one model” was supposed to be the state-of-the-art model, from the company that had a model named “strawberry” in honor of the same problem. And other models struggle with this extremely basic question too.

  • "If you are asked to count things, write a python program to do it". Problem solved.

OpenAI definitely tarnished the name of GPT-5 by allowing these issues to occur. It's clearly a smaller model optimized for cost and speed. Compare it to GPT-4.5 which didn't have these errors but was "too expensive for them".

This is why Anthropic naming system of haiku sonnet and opus to represent size is really nice. It prevents this confusion.

  • > This is why Anthropic naming system of haiku sonnet and opus to represent size is really nice. It prevents this confusion.

    In contrast to GPT-5, GPT-5 mini and GPT-5 nano?

    • I think it's a valid complaint that the naming scheme for the various GPT-4 models were very confusing. GPT-5 just launched, and doesn't (yet?) have a GPT-5 Turbo or GPT-o5 mini to muddy the waters.

      1 reply →

  • Yeah, one of the main reasons I switched my tooling over to default to Anthropic models despite starting out with OpenAI for months prior, was because I often switch between different model sizes depending on the complexity of the prompt vs the speed I want the result.

    I would frequently spend time going back to OpenAIs site to remind me of their different models. There’s no consistency there whatsoever. But with Anthropic is was easy.

    If I have to spend 5 minutes picking a model then I might as well do the task myself. So Claude became a natural solution for me.

  • > OpenAI definitely tarnished the name of GPT-5 by allowing these issues to occur

    For a certain class of customer maybe that is true.

    But the reality is that the fact that this occurs is very encouraging -- they are not micro-optimizing to solve cosmetic problems that serve no functional purpose. They are instead letting these phenomena serve as external benchmarks of a sort to evaluate how well the LLM can work on tasks that are outside of its training data, and outside of what one would expect the capabilities to be.

  • Oh wow, I stare at those model names every day, and I only just now after reading your comment realized what “haiku”, “sonnet”, and “opus” imply about the models! Seems super obvious in retrospect but never thought about it!

  • I mean yeah, but to many non-native speakers, sonnet and opus don't immediately convey size or complexity of the models.

    • I agree it’s not perfect. But it’s just 3 terms those non-English speakers need to learn. Which is a lot easier than having to remember every OpenAI model name and how it compares to every other one.

      7 replies →

    • I think non-native speakers have the ability to remember that one word equals big and another equals medium.

      If anything it's a lot less confusing that the awful naming convention from OpenAI up until 5.

      1 reply →

The hilarious thing is that LLMs will happily explain the token problem to you, and will even list out exactly which words where will have these problems. Then in the same breath when you ask it about letter counts, it will screw up.

It seems like this could be easily solved in models that support tool calling by providing them with a tool that takes a token and returns the individual graphemes.

It doesn't seem valuable for the model to memorize the graphemes in each of its tokens.

  • Yes, but are you going to special case all of these pain points? The whole point of these LLMs is that they learn from training data, not from people coding logic directly. If you do this people will come up with a dozen new ways in which the models fail. They are really not hard to find. Basically asking them to do anything novel is at risk of complete failure. The interesting bit is that LLMs tend to work best a "medium difficulty" problems. Homework questions and implementing documented APIs and things like that. Asking them to do anything completely novel tends to fail as does asking them to do something so trivial that normal humans won't bother even writing it down.

    • It makes sense when users ask for information not available in the tokenized values though. In the abstract, a tool that changes tokenization for certain context contents when a prompt references said contents is probably necessary to solve this issue (if you consider it worth solving).

      2 replies →

    • Tokenization is an inherent weakness of current LLM design, so it makes sense to compensate for it. Hopefully some day tokenization will no longer be necessary.

  • That takes away from the notion that LLMs have emergent intelligent abilities. Right now it doesn't seem valuable for a model to count letters, even though it is a very basic measure of understanding. Will this continue in other domains? Will we be doing tool-calling for every task that's not just summarizing text?

    • > Will we be doing tool-calling for every task that's not just summarizing text?

      spoiler: Yes. This has already become standard for production use cases where the LLM is an external-facing interface; you use an LLM to translate the user's human-language request to a machine-ready, well-defined schema (i.e. a protobuf RPC), do the bulk of the actual work with actual, deterministic code, then (optionally) use an LLM to generate a text result to display to the user. The LLM only acts as a user interface layer.

    • How is counting letters a measure of understanding, rather than a rote process?

      The reason LLMs struggle with this is because they literally aren't thinking in English. Their input is tokenized before it comes to them. It's like asking a Chinese speaker "How many Rs are there in the word 草莓".

      2 replies →

  • We're up to a gazillion parameters already, maybe the next step is to just ditch the tokenization step and let the LLMs encode the tokenization process internally?

A while back I encoded the message "Can you read this?" in rot13 (or something very similar) and sent it to chatGPT[. It answered back (yes) in with a sentence in rot13. Why could it do that but not count letters?

  • At the VERY beginning I asked it to reverse a hash, and was stunned that it was able to. This was in that first moment when the transformer burst on the scene.

    Turned out it just knew that particular string because it was on the internet. I don’t recall exactly but it was probably “hello world” or something like that.

    Anyway to your specific question, many of the models call Python tools behind the scenes now.

It might help a bit to expand this test to a short phrase. With such a small test the model can be right for the wrong reasons; opening up a bit of space to be wrong in might sharpen the differences.

(My one-off test of the default ChatGPT model, whatever that is, got 'How many b's are there in "Billy Bob beat the record for bounding?"' correct first try, with correct reasoning given.)

i tried with three different models : - hf.co/lmstudio-community/DeepSeek-R1-Distill-Llama-8B-GGUF:Q8_0 - qwen3:4b-instruct-2507-fp16 - hf.co/ibm-granite/granite-3.3-2b-instruct-GGUF:F16

ollama run qwen3:4b-instruct-2507-fp16 >>> how many b’s are there in blueberry? Let's break down the word *"blueberry"* step by step to count how many *b's* are in it.

Step 1: Write out the word clearly: *b l u e b e r r y*

Step 2: Go through each letter one by one:

- b → yes, a *b* - l → no - u → no - e → no - b → yes, another *b* - e → no - r → no - r → no - y → no

Step 3: Count the *b's*:

- First *b* at the beginning - Second *b* after "blue"

So, there are *two b's*.

Final answer: *2* b's in "blueberry".

>>> ollama run hf.co/ibm-granite/granite-3.3-2b-instruct-GGUF:F16 >>> how many b’s are there in blueberry? The word "blueberry" contains two 'b's. (fastest lol, granite models are pretty underated)

r1-distill output was similar to qwen instruct one but it double checked it's thinking part

I gave it a puzzle recently to test it out "An Ex-cop lost has home, his car and his girlfriend. What did he loose first?"

The AI thought and concluded that he had lost his job first, until I pointed out that it was not the first thing he had lost - which was his umbilical cord, a far better answer, in the AI's opinion.

Which raises many aspects - Can an AI disagree with you? Will AI develop solid out-of-the-box thinking as well as in-the-box thinking, will it grasp applying both for a thru the box thinking and solutions...

After all, we have yet to perfect the teaching of children, so the training of AI, has a long way to go and will get down to quality over quantity, just deciding what is quality and what is not. After all - Garbage in, Garbage out, is probably more important today than it ever was in the history of technology.

  • Beyond the hype, and even the recent breakthroughs in LLMs, it would be very valuable to start compiling and summarizing their specific limitations.

perhaps apocryphal but “A single lie discovered is enough to create doubt in every truth expressed.”. Why trust these tools for the hard things when we don't for the 'easy' ones.

>This is because the transformers are not able to take in or output actual text efficiently. Instead, the text is converted into numerical representations of itself, which is then contextualized to help the AI come up with a logical response. In other words, the AI might know that the tokens “straw” and “berry” make up “strawberry,” but it may not understand that “strawberry” is composed of the letters “s,” “t,” “r,” “a,” “w,” “b,” “e,” “r,” “r,” and “y,” in that specific order. Thus, it cannot tell you how many letters — let alone how many “r”s — appear in the word “strawberry.”

This is a great example. The LLM doesn't know something but it makes up something in it's place. Just because it made up something doesn't mean it's incapable of reasoning.

The thing with LLMs is that they can reason. There's evidence for that. But they can also be creative. And the line between reasoning and creativity at a low level is a bit of a blur as reasoning is a form of inference but so is creativity. So when an LLM reasons or gets creative or hallucinates it's ultimately doing the same type of thing: inference.

For us, we have mechanisms in our brain that allow us to tell the difference most of the time. The LLM does not. That's the fundamental line. And I feel because of this we are literally really close to AGI. A lot of people argue the opposite. They think reasoning and is core to intelligence and a separate concept from creativity and that all LLMs lack reasoning. I disagree.

In fact humans ourselves have trouble separating hallucination from reasoning. Look at religion. Religion permeates our culture but it's basically all hallucinations that we ultimately mistake for reasoning. Right? Ask any christian or muslim, the religions make rational sense to them! They can't tell the difference.

So the key is to give the LLM the ability to know the difference.

Is there some way to build into the transformer, some way to quantify whether something is fact or fiction? Like let's say the answer to a prompt created an inferenced datapoint that's very far far away from a cluster of data. From that we can derive some metric that quantifies how likely the response is based on evidence?

Right? The whole thing is on a big mathematical multidimensional durve. If the inferenced point on the curve is right next to existing data then it must be more likely to be true. If it's far away in some nether region of the curve then it's more likely to be false.

If the LLM can be more self aware and we can build this quantitative metric into the network then use reinforcement learning to kind of have the network be less sure about an answer if it's far away from a cluster of training data points we can likely very much improve the hallucination problem.

Of course I'm sure this is a blunt instrument as even false inferences data can be very close to existing training data. But at least this gives the LLM some level of self awareness of how reliable it's own answer was.

Project manager: “great news! Our model can count Rs in strawberry!”

Dev: “What about Bs in blueberry?”

PM: “you’ll need to open a new jira ticket”

  • this is literally what likely happens at these companies. ie, have teams that monitor twitter/social media for fails, and fix them with data patches.

    • Which is why I don't trust any of the benchmarks LLM enthusiasts point to when they say "see the model is getting better". I have zero confidence that the AI companies are trying to make the system better, rather than using the measure as a target.

      1 reply →

Can modern humans count the number of tokens in "blueberry"? (You're not allowed to cheat and use https://platform.openai.com/tokenizer)

  • I see this argument every time this particular failing is brought up and like, regardless of the reasoning, it's still completely failing to do something a second grader should be able to do and until it stops failing in that way, being like "Um technically it's a lot harder than you think" isn't going to win over AI skeptics.

  • No, but I can recognize my inability to do so instead of making the answer up.