Comment by microtonal

2 years ago

The problem is that the model never gets to see individual letters. The tokenizers used by these models break up the input in pieces. Even though the smallest pieces/units are bytes in most encodings (e.g. BBPE), the tokenizer will cut up most of the input in much larger units, because the vocabulary will contain fragments of words or even whole words.

For example, if we tokenize Welcome to Hacker News, I hope you like strawberries. The Llama 405B tokenizer will tokenize this as:

    Welcome Ġto ĠHacker ĠNews , ĠI Ġhope Ġyou Ġlike Ġstrawberries .

(Ġ means that the token was preceded by a space.)

Each of these pieces is looked up and encoded as a tensor with their indices. Adding a special token for the beginning and end of the text, giving:

    [128000, 14262, 311, 89165, 5513, 11, 358, 3987, 499, 1093, 76203, 13]

So, all the model sees for 'Ġstrawberries' is the number 76204 (which is then used in the piece embedding lookup). The model does not even have access to the individual letters of the word.

Of course, one could argue that the model should be fed with bytes or codepoints instead, but that would make them vastly less efficient with quadratic attention. Though machine learning models have done this in the past and may do this again in the future.

Just wanted to finish of this comment with saying that the tokens might be provided in the model splitted if the token itself is not in the vocabulary. For instance, the same sentence translated to my native language is tokenized as:

    Wel kom Ġop ĠHacker ĠNews , Ġik Ġhoop Ġdat Ġje Ġvan Ġa ard be ien Ġh oud t .

And the word voor strawberries (aardbeien) is split, though still not in letters.

12 comments

microtonal

TiredOfLife 2 years ago

The thing is, how the tokenizing work is about as relevant to the person asking the question as name of the cat of the delivery guy who delivered the GPU that the llm runs on.

microtonal 2 years ago
How the tokenizer works explains why a model can’t answer the question, what the name of the cat is doesn’t explain anything.
This is Hacker News, we are usually interested in how things work.
- VincentEvans 2 years ago
  
  Indeed, I appreciate the explanation, it is certainly both interesting and informative to me, but to somewhat echo the person you are replying to - if I wanted a boat, and you offer me a boat, and it doesn’t float - the reasons for failure are perhaps full of interesting details, but perhaps the most important thing to focus on first - is to make the boat float, or stop offering it to people who are in need of a boat.
  To paraphrase how this thread started - it was someone testing different boats to see whether they can simply float - and they couldn’t. And the reply was questioning the validity of testing boats whether they can simply float.
  At least this is how it sounds to me when I am told that our AI overlords can’t figure out how many Rs are in the word “strawberry”.
  
  8 replies →
dTal 2 years ago

It is however a highly relevant thing to be aware of when evaluating a LLM for 'intelligence', which was the context this was brought up in.
Without looking at the word 'strawberry', or spelling it one letter at a time, can you rattle off how many letters are in the word off the top of your head? No? That is what we are asking the LLM to do.