Comment by minimaxir

4 months ago

I made a response to this counterpoint in a blog post I wrote about a similar question posed to LLMs (how many b's are in blueberry): https://news.ycombinator.com/item?id=44878290

> Yes, asking an LLM how many b’s are in blueberry is an adversarial question in the sense that the questioner is expecting the LLM to fail. But it’s not an unfair question, and it’s objectively silly to claim that LLMs such as GPT-5 can operate at a PhD level, but can’t correctly count the number of letters in a word.

It's a subject that the Hacker News bubble and the real world treat differently.

> it’s objectively silly to claim that LLMs such as GPT-5 can operate at a PhD level, but can’t correctly count the number of letters in a word.

I know enough PhDs with heavy dyslexia that... no, there's no connection here. You can be a PhD level physicist without being able to spell anything.

It’s like defending a test showing hammers are terrible at driving screws by saying many people are unclear on how to use tools.

It remains unsurprising that a technology that lumps characters together is not great at processing below its resolution.

Now, if there are use cases other than synthetic tests where this capability is important, maybe there’s something interesting. But just pointing out that one can’t actually climb the trees pictured on the map is not that interesting.

  • And yet... now many of them can do it. I think it's premature to say "this technology is for X" when what it was originally invented for was translation, and every capability it has developed since then has been an immense surprise.

    • No, this is not what happened.

      What any reasonable person expects in "count occurrences of [letter] in [word]" is for a meta-language skill to kick in and actually look at the symbols, not the semantic word. It should count the e's in thee and the w's in willow.

      LLMs that use multi-symbol tokenization won't ever be able to do this. The information is lost in the conversion to embeddings. It's like giving you a 2x2 GIF and asking you to count the flowers: 2x2 is sufficient to determine dominant colors, but not fine detail.

      Instead, LLMs have been trained on the semantic facts that "strawberry has three r's" and other common tests, just like they're trained that the US has 50 states or motorcycles have two wheels. It's a fact stored in intrinsic knowledge, not a reasoning capability over the symbols the user input (which the actual LLM never sees).

      It's not a question of intent or adaptation, it's an information theory principle just like the Nyquist frequency.

    • > And yet... now many of them can do it.

      Presumably because they trained them to death on this useless test that people somehow just wouldn't shut up about.

      1 reply →