Comment by yahoozoo
4 days ago
Although a vast majority of tokens are 4+ characters, you’re seriously saying that each individual character of the English alphabet didn’t make the cut? What about 0-9?
4 days ago
Although a vast majority of tokens are 4+ characters, you’re seriously saying that each individual character of the English alphabet didn’t make the cut? What about 0-9?
Each character made the cut, but the word "strawberry" is a single token, and that single token is what the model gets as input. When humans read some text, they can see each individual character in the word "strawberry" everytime they see that word. LLMs don't see individual characters when they process input text containing the word "strawberry". They can only learn the spelling if some text explicitly maps "strawberry" to the sequence of characters s t r a w b e r r y. My guess is there are not enough of such mappings present in the training dataset for the model to learn it well.
The fact the word ends up being 1 token doesn’t mean model can’t track individual characters in it. The model transforms token into vector (of multiple thousands dimensionality), and I’m pretty sure there are dimensions corresponding to things like “if 1st character an ‘a’, 1st is ‘b’, 2nd is ‘a’ etc.
So tokens aren’t as important.
No, the vector is in a semantic embedding space. That's the magic.
So "the sky is blue" converts to the tokens [1820, 13180, 374, 6437]
And "le ciel est bleu" converts to the tokens [273, 12088, 301, 1826, 12704, 84]
Then the embeddings vectors created from these are very similar, despite the letters having very little in common.
1 reply →
Is there any evidence to support your hypothesis?
4 replies →
> the word "strawberry" is a single token, and that single token is what the model gets as input.
This is incorrect.
strawberry is actually 4 tokens (at least for GPT but most LLM are similar).
See https://platform.openai.com/tokenizer
I got 3 tokens: st, raw, and berry. My point still stands: processing "berry" as a single token does not allow the model to learn its spelling directly, the way human readers do. It still has to rely on an explicit mapping of the word "berry" to b e r r y explained in some text in the training dataset. If that explanation is not present in the training data, it cannot learn the spelling - in principle.
3 replies →