Comment by poemxo

2 months ago

That takes away from the notion that LLMs have emergent intelligent abilities. Right now it doesn't seem valuable for a model to count letters, even though it is a very basic measure of understanding. Will this continue in other domains? Will we be doing tool-calling for every task that's not just summarizing text?

> Will we be doing tool-calling for every task that's not just summarizing text?

spoiler: Yes. This has already become standard for production use cases where the LLM is an external-facing interface; you use an LLM to translate the user's human-language request to a machine-ready, well-defined schema (i.e. a protobuf RPC), do the bulk of the actual work with actual, deterministic code, then (optionally) use an LLM to generate a text result to display to the user. The LLM only acts as a user interface layer.

How is counting letters a measure of understanding, rather than a rote process?

The reason LLMs struggle with this is because they literally aren't thinking in English. Their input is tokenized before it comes to them. It's like asking a Chinese speaker "How many Rs are there in the word 草莓".

  • It shows understanding that words are made up of letters and that they can be counted

    Since tokens are atomic, which I didn't realize earlier, then maybe it's still intelligent if it can realize it can extract the result by writing len([b for b in word if b == my_letter]) and decide on its own to return that value.

  • But why doesn’t the LLM reply “I can’t solve this task because I see text as tokens”, rather than give a wrong answer?