Comment by generalizations
1 year ago
Testing models on their tokenization has always struck me as kinda odd. Like, that has nothing to do with their intelligence.
1 year ago
Testing models on their tokenization has always struck me as kinda odd. Like, that has nothing to do with their intelligence.
Its like showing someone a color and asking how many letters it has. 4... 3? blau, blue, azul, blu The color holds the meaning and the words all map back.
In the model the individual letters hold little meaning. Words are composed of letters but simply because we need some sort of organized structure for communication that helps represents meaning and intent. Just like our color blue/blau/azul/blu.
Not faulting them for asking the question but I agree that the results do not undermine the capability of the technology. In fact it just helps highlight the constraints and need for education.
How is a layman supposed to even know that it's testing on that? All they know is it's a large language model. It's not unreasonable they should expect it to be good at things having to do with language, like how many letters are in a word.
Seems to me like a legit question for a young child to answer or even ask.
> How is a layman supposed to even know that it's testing on that?
They're not, but laymen shouldn't think that the LLM tests they come up with have much value.
I'm saying a layman or say a child wouldn't even think this is a "test". They are just asking a language model a seemingly simple language related question from their point of view.
2 replies →
It doesn’t test “on tokenization” though. What happens when an answer is generated is few abstraction levels deeper than tokens. A “thinking” “slice” of an llm is completely unaware of tokens as an immediate part of its reasoning. The question just shows lack of systemic knowledge about strawberry as a word (which isn’t surprising, tbh).
It is. Strawberry is one token in many tokenziers. The model doesn't have a concept that there are letters there.
If I show you a strawberry and ask how many r’s are in the name of this fruit, you can tell me, because one of the things you know about strawberries is how to spell their name.
Very large language models also “know” how to spell the word associated with the strawberry token, which you can test by asking them to spell the word one letter at a time. If you ask the model to spell the word and count the R’s while it goes, it can do the task. So the failure to do it when asked directly (how many r’s are in strawberry) is pointing to a real weakness in reasoning, where one forward pass of the transformer is not sufficient to retrieve the spelling and also count the R’s.
3 replies →
The thinking part of a model doesn’t know about tokens either. Like a regular human few thousand years ago didn’t think of neural impulses or air pressure distribution when talking. It might “know” about tokens and letters like you know about neurons and sound, but not access them on the technical level, which is completely isolated from it. The fact that it’s a chat of tokens of letters, which are a form of information passing between humans, is accidental.
If I ask an LLM to generate new words for some concept or category, it can do that. How do the new words form, if not from joining letters?
4 replies →
This is pretty much equivalent to the statement "multicharacter tokens are a dead end for understanding text". Which I agree with.
1 reply →
I hear this a lot but there are vast sums of money thrown at where a model fails the strawberry cases.
Think about math and logic. If a single symbol is off, it’s no good.
Like a prompt where we can generate a single tokenization error at my work, by my very rough estimates, generates 2 man hours of work. (We search for incorrect model responses, get them to correct themselves, and if they can’t after trying, we tell them the right answer, and edit it for perfection). Yes even for counting occurrences of characters. Think about how applicable that is. Finding the next term in a sequence, analyzing strings, etc.
> Think about math and logic. If a single symbol is off, it’s no good.
In that case the tokenization is done at the appropriate level.
This is a complete non-issue for the use cases these models are designed for.
But we don’t restrict it to math or logical syntax. Any prompt across essentially all domains. The same model is expected to handle any kind of logical reasoning that can be brought into text. We don’t mark it incorrect if it spells an unimportant word wrong, however keep in mind the spelling of a word can be important for many questions, for example—off the top of my head: please concatenate “d”, “e”, “a”, “r” into a common English word without rearranging the order. The types of examples are endless. And any type of example it gets wrong, we want to correct it. I’m not saying most models will fail this specific example, but it’s to show the breadth of expectations.
> that has nothing to do with their intelligence.
Of course. Because these models have no intelligence.
Everyone who believes they do seem to believe intelligence derives from being able to use language, however, and not being able to tell how many times the letter r is in the word strawberry is a very low bar to not pass.
An LLM trained on single letter tokens would be able to, it just would be much more laborious to train.
Why would it be able to?
3 replies →
Surfacing and underscoring obvious failure cases for general "helpful chatbot" use is always going to be valuable because it highlights how the "helpful chatbot" product is not really intuitively robust.
Meanwhile, it helps make sure engineers and product designers who want to build a more targeted product around LLM technology know that it's not suited to tasks that may trigger those kinds of failures. This may be obvious to you as an engaged enthusiast or cutting edge engineer or whatever you are, but it's always going to be new information to somebody as the field grows.
I don’t know anything about LLMs beyond using ChatGPT and Copilot… but unless because of this lack of knowledge I am misinterpreting your reply - it sounds as if you are excusing the model giving a completely wrong answer to a question that anyone intelligent enough to learn alphabet can answer correctly.
The problem is that the model never gets to see individual letters. The tokenizers used by these models break up the input in pieces. Even though the smallest pieces/units are bytes in most encodings (e.g. BBPE), the tokenizer will cut up most of the input in much larger units, because the vocabulary will contain fragments of words or even whole words.
For example, if we tokenize Welcome to Hacker News, I hope you like strawberries. The Llama 405B tokenizer will tokenize this as:
(Ġ means that the token was preceded by a space.)
Each of these pieces is looked up and encoded as a tensor with their indices. Adding a special token for the beginning and end of the text, giving:
So, all the model sees for 'Ġstrawberries' is the number 76204 (which is then used in the piece embedding lookup). The model does not even have access to the individual letters of the word.
Of course, one could argue that the model should be fed with bytes or codepoints instead, but that would make them vastly less efficient with quadratic attention. Though machine learning models have done this in the past and may do this again in the future.
Just wanted to finish of this comment with saying that the tokens might be provided in the model splitted if the token itself is not in the vocabulary. For instance, the same sentence translated to my native language is tokenized as:
And the word voor strawberries (aardbeien) is split, though still not in letters.
The thing is, how the tokenizing work is about as relevant to the person asking the question as name of the cat of the delivery guy who delivered the GPU that the llm runs on.
11 replies →
How can I know whether any particular question will test a model on its tokenization? If a model makes a boneheaded error, how can I know whether it was due to lack of intelligence or due to tokenization? I think finding places where models are surprisingly dumb is often more informative than finding particular instances where they seem clever.
It's also funny, since this strawberry question is one where a model that's seriously good at predicting the next character/token/whatever quanta of information would get it right. It requires no reasoning, and is unlikely to have any contradicting text in the training corpus.
> How can I know whether any particular question will test a model on its tokenization?
Does something deal with separate symbols rather than just meaning of words? Then yes.
This affects spelling, math (value calculation), logic puzzles based on symbols. (You'll have more success with a puzzle about "A B A" rather than "ABA")
> It requires no reasoning, and is unlikely to have any contradicting text in the training corpus.
This thread contains contradictions. Every other announcement of an llm contains a comment with a contradicting text when people post the wrong responses.
I suppose what models should have are some instructions of things they aren’t good at and will need to break out into python code or what have you. Humans have an intuition for this - I have basic knowledge of when I need to write something down or use a calculator. LLMs don’t have intuition (yet - though I suppose one could use a smaller model for that), so explicit instructions would work for now.
It's not very interesting when they fail at it, but it will be interesting if they get good at it.
Also there are some cases where regular people will stumble into it being awful at this without any understanding why (like asking it to help them with their wordle game.)
Call me when models understand when to convert the token into actual letters and count them. Can’t claim they’re more than word calculators before that.
That's misleading.
When you read and comprehend text, you don't read it letter by letter, unless you have a severe reading disability. Your ability to comprehend text works more like an LLM.
Essentially, you can compare the human brain to a multi-model or modular system. There are layers or modules involved in most complex tasks. When reading, you recognize multiple letters at a time[], and those letters are essentially assembled into tokens that a different part of your brain can deal with.
Breaking down words into letters is essentially a separate "algorithm". Just like your brain, it's likely to never make sense for a text comprehension and generation model to operate at the level of letters - it's inefficient.
A multi-modal model with a dedicated model for handling individual letters could easily convert tokens into letters and operate on them when needed. It's just not a high priority for most use cases currently.
[]https://www.researchgate.net/publication/47621684_Letters_in...
I agree completely, that wasn’t the point though: the point was that my 6 yo knows when to spell the word when asked and the blob of quantized floats doesn’t, or at least not reliably.
So the blob wasn’t trained to do that (yeah low utility I get that) but it also doesn’t know it doesn’t know, which is an another much bigger and still unsolved problem.
1 reply →
The model communicates in a language, but our letters are not necessary for such and in fact not part of the english language. You could write english using per word pictographs and it would still be the same english&the same information/message. It's like asking you if there is a '5' in 256 but you read binary.
Is anyone in the know, aside from mainstream media (god forgive me for using this term unironically) and civillians on social media claiming LLMs are anything but word calculators?
I think that's a perfect description by the way, I'm going to steal it.
I think it's a very poor intuition pump. These 'word calculators' have lots of capabilities not suggested by that term, such as a theory of mind and an understanding of social norms. If they are a "merely" a "word calculator", then a "word calculator" is a very odd and counterintuitively powerful algorithm that captures big chunks of genuine cognition.
7 replies →
[dead]
> Like, that has nothing to do with their intelligence.
Because they don't have intelligence.
If they did, they could count the letters in strawberry.
People have been over this. If you believe this, you don't understand how LLMs work.
They fundamentally perceive the world in terms of tokens, not "letters".
> If you believe this, you don't understand how LLMs work.
Nor do they understand how intelligence works.
Humans don't read text a letter at a time. We're capable of deconstructing words into individual letters, but based on the evidence that's essentially a separate "algorithm".
Multi-model systems could certainly be designed to do that, but just like the human brain, it's unlikely to ever make sense for a text comprehension and generation model to work at the level of individual letters.
I would counterargue with "that's the model's problem, not mine".
Here's a thought experiment: if I gave you 5 boxes and told you "how many balls are there in all of this boxes?" and you answered "I don't know because they are inside boxes", that's a fail. A truly intelligent individual would open them and look inside.
A truly intelligent model would (say) retokenize the word into its individual letters (which I'm optimistic they can) and then would count those. The fact that models cannot do this is proof that they lack some basic building blocks for intelligence. Model designers don't get to argue "we are human-like except in the tasks where we are not".
Of course they lack building blocks for full intelligence. They are good at certain tasks, and counting letters is emphatically not one of them. They should be tested and compared on the kind of tasks they're fit for, and so the kind of tasks they will be used in solving, not tasks for which they would be misemployed to begin with.
I agree with you, but that's not what the post claims. From the article:
"A significant effort was also devoted to enhancing the model’s reasoning capabilities. (...) the new Mistral Large 2 is trained to acknowledge when it cannot find solutions or does not have sufficient information to provide a confident answer."
Words like "reasoning capabilities" and "acknowledge when it does not have enough information" have meanings. If Mistral doesn't add footnotes to those assertions then, IMO, they don't get to backtrack when simple examples show the opposite.
1 reply →
Its not like an LLM is released with a hit list of "these are the tasks I really suck at." Right now users have to figure it out on the fly or have a deep understanding of how tokenizers work.
That doesn't even take into account what OpenAI has typically done to intercept queries and cover the shortcomings of LLMs. It would be useful if each model did indeed come out with a chart covering what it cannot do and what it has been tailored to do above and beyond the average LLM.
Ah, so Nick Vujicic[0] would fail your "balls in a box" test, and is not an intelligent entity.
[0]: https://genius-u-attachments.s3.amazonaws.com/uploads/articl...
It just needs a little hint
Me: try again ChatGPT: There are two Rs in "strawberry."
1 reply →
LLMs are not truly intelligent.
Never have been, never will be. They model language, not intelligence.
We don't know what intelligence is. It's extremely arrogant to say that something or someone doesn't have it, and never will.
2 replies →
They model the dataset they were trained on. How would a dataset of what you consider intelligence look like?
1 reply →
Those who develop AI that know anything don't actually describe current technology as human like intelligence rather it is capable of many tasks which previously required human intelligence.