Comment by lossyalgo

4 days ago

Furthermore regarding reasoning, just ask any LLM how many "r letters are in strawberry" - repeat maybe 3 times just to get a feeling for how much variance in answers you can get. And this "quirk" of the inability to get the right answer is something that after 2 years making fun of LLMs online on various forums is still an issue. The models aren't getting smarter, and definitely aren't thinking, they are still token generators with a few tricks on top to make them seem more intelligent than predecessors.

Sorry to say, Claude Sonnet 4.5 (with and without extended thinking) has this one solved now. So, not “any” LLM, but maybe some still need work.

I presume they finally got around to training the internal reasoning loop into predicting when it needs to apply answer predictions character-by-character.

—- Transcripts from Claude Sonnet 4.5, initially with extended thinking:

> How many a’s in the phrase “artichokes and aardvarks”? Be sure to check your answer.

> Let me count the letter ‘a’ in “artichokes and aardvarks”: • artichokes: 1 ‘a’ (at the beginning) • and: 1 ‘a’ (at the beginning) • aardvarks: 3 ‘a’s (two at the beginning: “aa”, and one in the middle: “ardvarks”) Total: 1 + 1 + 3 = 5 ‘a’s Let me verify by marking them: artichokes and aardvarks Yes, there are 5 letter ‘a’s in the phrase “artichokes and aardvarks”.

> Are you sure that’s correct?

> Let me recount more carefully: “artichokes and aardvarks” Going through each word letter by letter: 1. artichokes - has 1 ‘a’ (first letter) 2. and - has 1 ‘a’ (first letter) 3. aardvarks - has 3 ‘a’s (first letter, second letter, and one more in the middle) Total: 1 + 1 + 3 = 5 ‘a’s Yes, I’m confident the answer is 5. Is there something specific you’d like me to recheck?

> How many r’s in strawberry?

> Let me count the letter ‘r’ in “strawberry”: s-t-r-a-w-b-e-r-r-y Going through each letter: • s: not an r • t: not an r • r: 1st r • a: not an r • w: not an r • b: not an r • e: not an r • r: 2nd r • r: 3rd r • y: not an r There are 3 r’s in “strawberry”.

<Without extended thinking> > How many u’s in universalulis?

> Looking at “universalulis” letter by letter: u-n-i-v-e-r-s-a-l-u-l-i-s There are 2 u’s in “universalulis”.

For what it's worth, when I ask ChatGPT 5, it gets the correct answer every time. The response varies, but the answer is always three.

  • Now try a different language. My take is hard RL tuning to fix these "gotcha:s" since the underlying model can't do it on its own.

    OpenAI is working on ChatGPT the application and ecosystem. They have transitioned from model building to software engineering with RL tuning and integration of various services to solve the problems the model can't do on its own. Make it feel smart rather than be smart.

    This means that as soon as you find a problem where you step out of the guided experience you get the raw model again which fails when encountering these "gotchas".

    Edit - Here's an example where we see a very tuned RL experience in English where a whole load of context is added on how to solve the problem while the Swedish prompt for the same word fails.

    https://imgur.com/a/SlD84Ih

    • You can tell it "be careful about the tokenizer issues" in Swedish and see how that changes the behavior.

      The only thing that this stupid test demonstrates is that LLM metacognitive skills are still lacking. Which shouldn't be a surprise to anyone. The only surprising thing is that they have metacognitive skills, despite the base model training doing very little to encourage their development.

      2 replies →