← Back to context

Comment by brookst

4 months ago

No, this is not what happened.

What any reasonable person expects in "count occurrences of [letter] in [word]" is for a meta-language skill to kick in and actually look at the symbols, not the semantic word. It should count the e's in thee and the w's in willow.

LLMs that use multi-symbol tokenization won't ever be able to do this. The information is lost in the conversion to embeddings. It's like giving you a 2x2 GIF and asking you to count the flowers: 2x2 is sufficient to determine dominant colors, but not fine detail.

Instead, LLMs have been trained on the semantic facts that "strawberry has three r's" and other common tests, just like they're trained that the US has 50 states or motorcycles have two wheels. It's a fact stored in intrinsic knowledge, not a reasoning capability over the symbols the user input (which the actual LLM never sees).

It's not a question of intent or adaptation, it's an information theory principle just like the Nyquist frequency.