Comment by pxc

2 months ago

Here's an example of what gpt-oss-20b (at the default mxfp4 precision) does with this question:

> How many "s"es are in the word "Mississippi"?

The "thinking portion" is:

> Count letters: M i s s i s s i p p i -> s appears 4 times? Actually Mississippi has s's: positions 3,4,6,7 = 4.

The answer is:

> The word “Mississippi” contains four letter “s” s.

They can indeed do some simple pattern matching on the query, separate the letters out into separate tokens, and count them without having to do something like run code in a sandbox and ask it the answer.

The issue here is just that this workaround/strategy is only trained into the "thinking" models, afaict.

That proves nothing. The fact that Mississippi has 4 "s" is far more likely to be in the training data than the fact that blueberry has 2 "b"s.

And now that fact is going to be in the data for the next round of training. We'll need to need to try some other words on the next model.

  • It does the same thing with a bunch of different words like "committee", "disestablishmentarianism", "dog", "Anaxagoras", and a string I typed by mashing the keyboard, "jwfekduadasjeudapu". It seems fairly general and to perform pretty reliably.

    (Sometimes the trace is noisier, especially in quants other than the original.)

    This task is pretty simple and I think can be solved easily with the same kind of statistical pattern matching these models use to write other text.

I'll be impressed when you can reliably give them a random four-word phrase for this test. Because I don't think anyone is going to try to teach them all those facts; even if they're trained to know letter counts for every English word (as the other comment cites as a possibility), they'd then have to actually count and add, rather than presenting a known answer plus a rationalization that looks like counting and adding (and is easy to come up with once an answer has already been decided).

(Yes, I'm sure an agentic + "reasoning" model can already deduce the strategy of writing and executing a .count() call in Python or whatever. That's missing the point.)

  • 5 "b"s, not counting the parenthetical at the end.

    https://claude.ai/share/943961ae-58a8-40f6-8519-af883855650e

    Amusingly, a bit of a struggle with understanding what I wanted with the python script to confirm the answer.

    I really don't get why people think this is some huge un-fixable blindspot...

    • I don't think the salience of this problem is that it's a supposedly unfixable blind spot. It's an illustrative failure in that it breaks the illusory intuition that something that can speak and write to us (sometimes very impressively!) also thinks like us.

      Nobody who could give answers as good as ChatGPT often does would struggle so much with this task. The fact that an LLM works differently from a whole-ass human brain isn't actually surprising when we consider it intellectually, but that habit of always intuiting a mind behind language whenever we see language is subconscious and and reflexive. Examples of LLM failures which challenge that intuition naturally stand out.

    • That indeed looks pretty good. But then why are we still seeing the issue described in OP?

  • You can already do it with arbitrary strings that aren't in the dictionary. But I wonder if the pattern matching will break once strings are much longer than any word in the dictionary, even if there's plenty of room left in context and all that.