Comment by Jensson
4 days ago
They are trained on many billions of tokens of text dealing with character level input, they would be rather dumb if they couldn't learn it anyway.
Every human learns that, when you hear the sound "strawberry" you don't hear the double r there, yet you still know the answer.
These models operate on tokens, not characters. It’s true that training budgets could be spent on exhaustively enumerating how many of each letter are in every word in every language, but it’s just not useful enough to be worth it.
It’s more like asking a human for the Fourier components of how they pronounce “strawberry”. I mean the audio waves are right there, why don’t you know?
Although a vast majority of tokens are 4+ characters, you’re seriously saying that each individual character of the English alphabet didn’t make the cut? What about 0-9?
Each character made the cut, but the word "strawberry" is a single token, and that single token is what the model gets as input. When humans read some text, they can see each individual character in the word "strawberry" everytime they see that word. LLMs don't see individual characters when they process input text containing the word "strawberry". They can only learn the spelling if some text explicitly maps "strawberry" to the sequence of characters s t r a w b e r r y. My guess is there are not enough of such mappings present in the training dataset for the model to learn it well.
13 replies →