Comment by wruza

1 year ago

It doesn’t test “on tokenization” though. What happens when an answer is generated is few abstraction levels deeper than tokens. A “thinking” “slice” of an llm is completely unaware of tokens as an immediate part of its reasoning. The question just shows lack of systemic knowledge about strawberry as a word (which isn’t surprising, tbh).

It is. Strawberry is one token in many tokenziers. The model doesn't have a concept that there are letters there.

  • If I show you a strawberry and ask how many r’s are in the name of this fruit, you can tell me, because one of the things you know about strawberries is how to spell their name.

    Very large language models also “know” how to spell the word associated with the strawberry token, which you can test by asking them to spell the word one letter at a time. If you ask the model to spell the word and count the R’s while it goes, it can do the task. So the failure to do it when asked directly (how many r’s are in strawberry) is pointing to a real weakness in reasoning, where one forward pass of the transformer is not sufficient to retrieve the spelling and also count the R’s.

    • Sure, that's a different issue. If you prompt in a way to invoke chain of thought (e.g. what humans would do internally before answering) all of the models I just tested got it right.

    • > If I show you a strawberry and ask how many r’s are in the name of this fruit, you can tell me, because one of the things you know about strawberries is how to spell their name.

      LOL. I would fail your test, because "fraise" only has one R, and you're expecting me to reply "3".

  • The thinking part of a model doesn’t know about tokens either. Like a regular human few thousand years ago didn’t think of neural impulses or air pressure distribution when talking. It might “know” about tokens and letters like you know about neurons and sound, but not access them on the technical level, which is completely isolated from it. The fact that it’s a chat of tokens of letters, which are a form of information passing between humans, is accidental.

  • If I ask an LLM to generate new words for some concept or category, it can do that. How do the new words form, if not from joining letters?

    • The tokenizer system supports virtually any input text that you want, so it follows that it also allows virtually any output text. It isn’t limited to a dictionary of the 1000 most common words or something.

      There are tokens for individual letters, but the model is not trained on text written with individual tokens per letter, it is trained on text that has been converted into as few tokens as possible. Just like you would get very confused if someone started spelling out entire sentences as they spoke to you, expecting you to reconstruct the words from the individual spoken letters, these LLMs also would perform terribly if you tried to send them individual tokens per letter of input (instead of the current tokenizer scheme that they were trained on).

      Even though you might write a message to an LLM, it is better to think of that as speaking to the LLM. The LLM is effectively hearing words, not reading letters.

  • This is pretty much equivalent to the statement "multicharacter tokens are a dead end for understanding text". Which I agree with.

    • That doesn't follow from what he said at all. Knowing how to spell words and understanding them are basically unrelated tasks.