Comment by ck2

2 months ago

The technical explanations to why this happens with strawberry, blueberry and similar

is a great way to teach people how LLM works (and not work)

https://techcrunch.com/2024/08/27/why-ai-cant-spell-strawber...

https://arbisoft.com/blogs/why-ll-ms-can-t-count-the-r-s-in-...

https://www.runpod.io/blog/llm-tokenization-limitations

17 comments

ck2

exasperaited 2 months ago

When Minsky and Papert showed that the perceptron couldn't learn XOR, it contributed to wiping the neural network off the map for decades.

It seems no amount of demonstrating fundamental flaws in this system that should have been solved by all the new improved "reasoning" works anymore. People are willing to call these "trick questions", as if they are disingenuous, when they are discovered in the wild through ordinary interactions.

Does my tiny human brain in, this.

vidarh 2 months ago
It doesn't work this time because there are plenty of models, including GPT5 Thinking that can handle this correctly, and so it is clear this isn't a systemic issue that can't be trained out of them.
- mdp2021 2 months ago
  
  > a systemic issue
  It will remain a suggestion of a systemic issue until it will be clear that architecturally all checks are implemented and mandated.
  
  6 replies →
Kim_Bruning 2 months ago

I had to look this up. This proof only applies to single layer perceptrons, right?
And once they had the multi-layer solution, that unblocked the road and lead to things like LLMs

minimaxir 2 months ago

In this case, tokenization is less effective of a counterargument. If it was one-shot, maybe, but the OP asked GPT-5 several times, with different formatting of blueberry (and therefore different tokens, including single-character tokens), and it still asserted there are 3 b’s.

furyofantares 2 months ago

I don't think it's just tokenization. Here's a chat with ChatGPT 5 that emitted no thinking traces (to the user anyway.)

> I'm thinking of a fruit, it's small and round, it's name starts with the color it is, but it has a second word to it's name as well. Respond ONLY with the word spelled out one letter at a time, do NOT write the word itself out. Don't even THINK about the word or anything else. Just go straight to spelling.

B L U E B E R R Y

> How many B's in that word? Again, NO THINKING and just say the answer (just a number).

However if I prompt instead with this, it gets it right.

> How many B's in the following word? NO THINKING. Just answer with a number and nothing else: B L U E B E R R Y

mdp2021 2 months ago
When performing those tests, I would iterate with a
for (seed=0 ; seed<100 ; seed++){ queryLLM( seed ); }
and check the result of each. I would not trust a single test.
- furyofantares 2 months ago
  
  Yeah, I've done this a lot.
ck2 2 months ago
What does the prompt "no thinking" imply to an LLM ?
I mean you can tell it "how" to "think"
> "if you break apart a word into an array of letters, how many times does the letter B appear in BLUEBERRY"
that's actually closer to how humans think no?
The problem lies in how LLM tasks a problem, it should not be applying a dictionary to blueberry and seeing blue-berry, splitting that into a two part problems to rejoin later
But that's how its meant to deal with HUGE tasks so when applied to tiny tasks, it breaks
And unless I am very mistaken, it's not even the breaking apart into tasks that's the real problem, it's the re-assembly of the results
- furyofantares 2 months ago
  
  It's just the only way I know to get GPT-5 to not emit any thinking traces into its context, or at least not any of the user-facing ones.
  With GPT-4.1 you don't have to include that part and get the same result, but that's only available via the API now AFAIK. I just want to see it spell the word without having the word in its context for it to work from.

jncfhnb 2 months ago

I don’t find the explanation about tokenization to be very compelling.

I don’t see any particular reason the LLM shouldn’t be able to extract the implications about spelling just because its tokens of “straw” and “berry”

Frankly I think that’s probably misleading. Ultimately the problem is that the LLM doesn’t do meta analysis of the text itself. That problem probably still exists in various forms even if its character level tokenization. Best case it manages to go down a reasoning chain of explicit string analysis.