Comment by aDyslecticCrow
4 hours ago
> character counting
The models now whaste a vast amount of useless neurons memorising the character count the entire English language so that people can ask how many r's are in strawberry and check a tickbox in a benchmark.
The architecture cannot efficiently or consistently represent counting letters in words. We should never have forced trained them to do it.
This goes for other more important "skills" that are unsuited to tranformer models.
Most models can now do decent arithmetics. But if you knew how it has encoded that ability in its neurons then you would never ever ever ever trust any arithmetic it ever outputs, even in seems to "know" it (unless it called a calculator MCP to achieve it).
There are fundamental limitations, but we're currently brute forcing ourselves through problems we could trivially solve with a different tool.
> The models now whaste a vast amount of useless neurons memorising the character count the entire English language
No they don’t. They only need to know the character count for each token, and with typical vocabularies having around 250k entries, that’s an insignificant number for all but the tiniest LLMs.
In a very simplified view;
Those "tolkens" humans "count" are translated to a ~2048 (depends on model) floating point vector.
bird => {mamal, english, noun, Vertebrate, aviant} has one r but what if you make it 20% more "french". Is is still 1 r? That could be the word "bird" in french, or it could be a french speaking bird or a bird species common in france.
If nearest neibour distance to the vocabulary of every language makes the vector no longer map to "bird"; then the amount of rs' must change, using a series of trained conditional checks (with some efficiency where languages have some general spelling patterns).
That is such an unreasonable amount of compute, that it is likley faar cheaper, easier and more reliable to train the model to memorise the output:
{"MCP":"python", "content":"len((c for c in 'strawberry' if c='r'))"}
The attention mechanism allow LLMs to learn this kind of absurdly inefficient calculations. But we really shouldn't use LLMs where they're outperformed by trivial existing solutions.
Nope. Tokens aren't what you think they are.