Comment by poemxo
2 months ago
That takes away from the notion that LLMs have emergent intelligent abilities. Right now it doesn't seem valuable for a model to count letters, even though it is a very basic measure of understanding. Will this continue in other domains? Will we be doing tool-calling for every task that's not just summarizing text?
> Will we be doing tool-calling for every task that's not just summarizing text?
spoiler: Yes. This has already become standard for production use cases where the LLM is an external-facing interface; you use an LLM to translate the user's human-language request to a machine-ready, well-defined schema (i.e. a protobuf RPC), do the bulk of the actual work with actual, deterministic code, then (optionally) use an LLM to generate a text result to display to the user. The LLM only acts as a user interface layer.
How is counting letters a measure of understanding, rather than a rote process?
The reason LLMs struggle with this is because they literally aren't thinking in English. Their input is tokenized before it comes to them. It's like asking a Chinese speaker "How many Rs are there in the word 草莓".
It shows understanding that words are made up of letters and that they can be counted
Since tokens are atomic, which I didn't realize earlier, then maybe it's still intelligent if it can realize it can extract the result by writing len([b for b in word if b == my_letter]) and decide on its own to return that value.
But why doesn’t the LLM reply “I can’t solve this task because I see text as tokens”, rather than give a wrong answer?