Comment by nl

4 days ago

> the word "strawberry" is a single token, and that single token is what the model gets as input.

This is incorrect.

strawberry is actually 4 tokens (at least for GPT but most LLM are similar).

See https://platform.openai.com/tokenizer

4 comments

I got 3 tokens: st, raw, and berry. My point still stands: processing "berry" as a single token does not allow the model to learn its spelling directly, the way human readers do. It still has to rely on an explicit mapping of the word "berry" to b e r r y explained in some text in the training dataset. If that explanation is not present in the training data, it cannot learn the spelling - in principle.

brookst 4 days ago

Exactly. If “st” is 123, “raw” is 456, “berry” is 789, and “r” is 17… it makes little sense to ask the models to count the [17]’s in [123,466,789]: it demands an awareness of the abstraction that does not exist.
To the extent the knowledge is there it’s from data in the input corpus, not direct examination of the text or tokens in the prompt.
asadotzler 4 days ago
So much for generalized intelligence, I guess.
- kadushka 4 days ago
  
  Is a human who never learned how to read not generally intelligent?