Comment by Kuinox

1 year ago

Tokenization make it hard for it to count the letters, that's also why if you ask it to do maths, writing the number in letters will yield better results.

for strawberry, it see it as [496, 675, 15717], which is str aw berry.

If you insert characters to breaks the tokens down, it find the correct result: how many r's are in "s"t"r"a"w"b"e"r"r"y" ?

> There are 3 'r's in "s"t"r"a"w"b"e"r"r"y".

>If you insert characters to breaks the tokens down, it find the correct result: how many r's are in "s"t"r"a"w"b"e"r"r"y" ?

The issue is that humans don't talk like this. I don't ask someone how many r's there are in strawberry by spelling out strawberry, I just say the word.

  • Humans also constantly make mistakes that are due to proximity in their internal representation. "Could of"/"Should of" comes to mind: the letters "of" have a large edit distance from "'ve", but their pronunciation is very similar.

    Especially native speakers are prone to the mistake as they grew up learning english as illiterate children, from sounds only, compared to how most people learning english as second language do it, together with the textual representation.

    Psychologists use this trick as well to figure out internal representations, for example the rorschach test.

    And probably, if you asked random people in the street how many p's there is in "Philippines", you'd also get lots of wrong answers. It's tricky due to the double p and the initial p being part of an f sound. The demonym uses "F" as the first letter, and in many languages, say Spanish, also the country name uses an F.

  • > I don't ask someone how many r's there are in strawberry by spelling out strawberry, I just say the word.

    No, I would actually be pretty confident you don’t ask people that question… at all. When is the last time you asked a human that question?

    I can’t remember ever having anyone in real life ask me how many r’s are in strawberry. A lot of humans would probably refuse to answer such an off-the-wall and useless question, thus “failing” the test entirely.

    A useless benchmark is useless.

    In real life, people overwhelmingly do not need LLMs to count occurrences of a certain letter in a word.

    • AI being the same as human, while AI fails at a task that any human can easily do means AI isn't human equivalent in an easily demonstrable way.

      If full artificial intelligence, as we're being promised, falls short in this simple way.

  • Count the number of occurrences of the letter e in the word "enterprise".

    Problems can exist as instances of a class of problems. If you can't solve a problem, it's useful to know if it's a one off, or if it belongs to a larger class of problems, and which class it belongs to. In this case, the strawberry problem belongs to the much larger class of tokenization problems - if you think you've solved the tokenization problem class, you can test a model on the strawberry problem, with a few other examples from the class at large, and be confident that you've solved the class generally.

    It's not about embodied human constraints or how humans do things; it's about what AI can and can't do. Right now, because of tokenization, things like understanding the number of Es in strawberry are outside the implicit model of the word in the LLM, with downstream effects on tasks it can complete. This affects moderation, parsing, generating prose, and all sorts of unexpected tasks. Having a workaround like forcing the model to insert spaces and operate on explicitly delimited text is useful when affected tasks appear.

  • Humans also would probably be very likely to guess 2 r's if they had never seen any written words or had the word spelled out to them as individual letters before, which is kind of close to how lanugage models treat it, despite being a textual interface.

  • > Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe.

    We are also not exactly looking letter by letter at everything we read.

  • It's not a human. I imagine if you have a use case where counting characters is critical, it would be trivial to programmatically transform prompts into lists of letters.

    A token is roughly four letters [1], so, among other probable regressions, this would significantly reduce the effective context window.

    [1] https://help.openai.com/en/articles/4936856-what-are-tokens-...

    • This is the kind of task that you'd just use a bash one liner for, right? LLM is just wrong tool for the job.

  • Humans do chain-of-thought.

    User: Write “strawberry” one letter at a time, with a space between each letter. Then count how many r’s are in strawberry.

    gpt-3.5-turbo: ASSISTANT s t r a w b e r r y

    There are 2 r's in strawberry.

    After some experimenting, it seems like the actual problem is that many LLMs can’t count.

    User: How many r’s are in the following sequence of letters:

    S/T/R/A/W/B/E/R/R/Y

    gpt-4o-mini: In the sequence S/T/R/A/W/B/E/R/R/Y, there are 2 occurrences of the letter "R."

    Oddly, if I change a bunch of the non-R letters, I seem to start getting the right answer.

  • >I don't ask someone how many r's there are in strawberry by spelling out strawberry, I just say the word.

    You don't ask a human being how many r's there are in strawberry at all. The only reason you or anyone else asks that question is because it's an interesting quirk of how LLMs work that they struggle to answer it in that format. It's like an alien repeatedly showing humans an optical illusion that relies on the existence of our (literal) blind spot and using it as evidence of our supposed lack of intelligence.

  • This is only an issue if you send commands to a LLM as you were communicating to a human.

    • > This is only an issue if you send commands to a LLM as you were communicating to a human.

      Yes, it's an issue. We want the convenience of sending human-legible commands to LLMs and getting back human-readable responses. That's the entire value proposition lol.

      1 reply →

Where did you get this idea from?

Tokens aren’t the source of facts within a model. it’s an implementation detail and doesn’t inherently constrain how things could be counted.

  • Tokens are the first form of information being encoded into the model. They're statistically guided, more or less a compression dictionary comparable to a Lempel Ziv setup.

    Combinations of tokens get encoded, so if the feature isn't part of the information being carried forward into the network as it models the information in the corpus, the feature isn't modeled well, or at all. The consequence of having many character tokens is that the relevance of individual characters is lost, and you have to explicitly elicit the information. Models know that words have individual characters, but "strawberry" isn't encoded as a sequence of letters, it's encoded as an individual feature of the tokenizer embedding.

    Other forms of tokenizing have other tradeoffs. The trend lately is to increase tokenizer dictionary scope, up to 128k in Llama3 from 50k in gpt-3. The more tokens, the more nuanced individual embedding features in that layer can be before downstream modeling.

    Tokens inherently constrain how the notion of individual letters are modeled in the context of everything an LLM learns. In a vast majority of cases, the letters don't matter, so the features don't get mapped and carried downstream of the tokenizer.

    • So, it's conjecture then?

      What you're saying sounds plausible, but I don't see how we can conclude that definitively without at least some empirical tests, say a set words that predictably give an error along token boundaries.

      The thing is, there are many ways a model can get around to answering the same question, it doesn't just depend on the architecture but also on how the training data is structured.

      For example, if it turned out tokenization was the cause of this glitch, conceivably it could be fixed by adding enough documents with data relating to letter counts, providing another path to get the right output.

  • There isn't a lot of place that teach the AI which letters there is in each tokens. It's a made up concept, and the AI doesn't have enough information in the dataset about this concept, it can difficulty generalize this.

    There is a lot of problems like that, that can be reformulated. For example if you ask it what is the biggest between 9.11 and 9.9, it will respond 9.9. If you look at how it's tokenized, you can see it restate an easy problem as something not straightforward even for a human. if you restart the problem by writing the number in full letters, it will correctly respnod.