Comment by kadushka

4 days ago

Each character made the cut, but the word "strawberry" is a single token, and that single token is what the model gets as input. When humans read some text, they can see each individual character in the word "strawberry" everytime they see that word. LLMs don't see individual characters when they process input text containing the word "strawberry". They can only learn the spelling if some text explicitly maps "strawberry" to the sequence of characters s t r a w b e r r y. My guess is there are not enough of such mappings present in the training dataset for the model to learn it well.

13 comments

kadushka

boroboro4 4 days ago

The fact the word ends up being 1 token doesn’t mean model can’t track individual characters in it. The model transforms token into vector (of multiple thousands dimensionality), and I’m pretty sure there are dimensions corresponding to things like “if 1st character an ‘a’, 1st is ‘b’, 2nd is ‘a’ etc.

So tokens aren’t as important.

brookst 4 days ago
No, the vector is in a semantic embedding space. That's the magic.
So "the sky is blue" converts to the tokens [1820, 13180, 374, 6437]
And "le ciel est bleu" converts to the tokens [273, 12088, 301, 1826, 12704, 84]
Then the embeddings vectors created from these are very similar, despite the letters having very little in common.
- boroboro4 3 days ago
  
  Character on 1st/2nd/3rd place is part of semantic space in generic meaning of the word. I ran experiments which seemingly ~support my hypothesis below.
kadushka 4 days ago
Is there any evidence to support your hypothesis?
- boroboro4 3 days ago
  
  Good question! I did a small experiment: trained a small logistics regression from embedding vectors into 1st/2nd/3rd character in token: https://chatgpt.com/share/6871061a-7948-8007-ab53-5b0b697e90...
  I got 0.863 (for 1st)/0.559 (for 2nd)/0.447 (for 3rd) accuracy for Qwen 3 8B model embeddings. Note the code is hacky and might be wrong in ways + in reality transformers do know more because here I utilize only embedding layer. However it does show there are very clear signals on characters in tokens in embedding vectors.
  
  3 replies →

nl 4 days ago

> the word "strawberry" is a single token, and that single token is what the model gets as input.

This is incorrect.

strawberry is actually 4 tokens (at least for GPT but most LLM are similar).

See https://platform.openai.com/tokenizer

kadushka 4 days ago
I got 3 tokens: st, raw, and berry. My point still stands: processing "berry" as a single token does not allow the model to learn its spelling directly, the way human readers do. It still has to rely on an explicit mapping of the word "berry" to b e r r y explained in some text in the training dataset. If that explanation is not present in the training data, it cannot learn the spelling - in principle.
- brookst 4 days ago
  
  Exactly. If “st” is 123, “raw” is 456, “berry” is 789, and “r” is 17… it makes little sense to ask the models to count the [17]’s in [123,466,789]: it demands an awareness of the abstraction that does not exist.
  To the extent the knowledge is there it’s from data in the input corpus, not direct examination of the text or tokens in the prompt.
- asadotzler 4 days ago
  
  So much for generalized intelligence, I guess.
  
  1 reply →