Comment by IanCal

2 months ago

The tokenisation means they don’t see the letters at all. They see something like this - to convert just some tokens to words

How many 538 do you see in 423, 4144, 9890?

LLMs don’t see token ids, they see token embeddings that map to those ids, and those embeddings are correlated. The hypothetical embeddings of 538, 423, 4144, and 9890 are likely strongly correlated in the process of training the LLM and the downstream LLM should be able to leverage those patterns to solve the question correctly. Even more so since the training process likely has many examples of similar highly correlated embeddings to identify the next similar token.

  • But vitally they are not explicitly shown the letters individually and so “count the letters” is a much harder problem to solve than it is for us.