Comment by Kwpolska

2 months ago

The Internet has been poking fun at LLMs failing to count letters for many months. Are the AI companies really living in an echo chamber? They should have implemented a thing to count letters (and to search for palindromes, and…), and just have the LLM pass the request to the count_letters function, as they do for many other tasks already…

Why on earth would they do this? This is not a fundamentally useful task; it serves as a measure of the LLM's ability to generalize to tasks outside of its training data and that strain the limits of what it can express.

  • Because optics matter. they are all ultimately fundraising and competing and this is terrible PR

    Ask Jeeves from 1997 could answer this question, so tell me why we need to devote a nation-state amount of compute power to feed an “AI” that confidently gets kindergarten level questions dead ass wrong?

    I have the same kind of question when I watch the AI summary on Google output tokens one-by-one to give me less useful information that is right there on the first search result from Wikipedia (fully sourced, too)

  • If you’re advertising that your new LLM is like a PhD in your pocket, and it fails on a task that a first grader can do, it makes it hard to take your other claims seriously.

> Are the AI companies really living in an echo chamber?

The author tested 12 models, and only one was consistently wrong. More than half were correct 100% of the time.

A better conclusion would be that there’s something in particular wrong with GPT-5 Chat, all the other GPT 5 variants are OK. I wonder what’s different?

  • This “one model” was supposed to be the state-of-the-art model, from the company that had a model named “strawberry” in honor of the same problem. And other models struggle with this extremely basic question too.

"If you are asked to count things, write a python program to do it". Problem solved.