Comment by drewnick

2 years ago

All 3 models you ranked cannot get "how many r's are in strawberry?" correct. They all claim 2 r's unless you press them. With all the training data I'm surprised none of them fixed this yet.

193 comments

drewnick

Kuinox 2 years ago

Tokenization make it hard for it to count the letters, that's also why if you ask it to do maths, writing the number in letters will yield better results.

for strawberry, it see it as [496, 675, 15717], which is str aw berry.

If you insert characters to breaks the tokens down, it find the correct result: how many r's are in "s"t"r"a"w"b"e"r"r"y" ?

> There are 3 'r's in "s"t"r"a"w"b"e"r"r"y".

GenerWork 2 years ago
>If you insert characters to breaks the tokens down, it find the correct result: how many r's are in "s"t"r"a"w"b"e"r"r"y" ?
The issue is that humans don't talk like this. I don't ask someone how many r's there are in strawberry by spelling out strawberry, I just say the word.
- est31 2 years ago
  
  Humans also constantly make mistakes that are due to proximity in their internal representation. "Could of"/"Should of" comes to mind: the letters "of" have a large edit distance from "'ve", but their pronunciation is very similar.
  Especially native speakers are prone to the mistake as they grew up learning english as illiterate children, from sounds only, compared to how most people learning english as second language do it, together with the textual representation.
  Psychologists use this trick as well to figure out internal representations, for example the rorschach test.
  And probably, if you asked random people in the street how many p's there is in "Philippines", you'd also get lots of wrong answers. It's tricky due to the double p and the initial p being part of an f sound. The demonym uses "F" as the first letter, and in many languages, say Spanish, also the country name uses an F.
  
  6 replies →
- coder543 2 years ago
  
  > I don't ask someone how many r's there are in strawberry by spelling out strawberry, I just say the word.
  No, I would actually be pretty confident you don’t ask people that question… at all. When is the last time you asked a human that question?
  I can’t remember ever having anyone in real life ask me how many r’s are in strawberry. A lot of humans would probably refuse to answer such an off-the-wall and useless question, thus “failing” the test entirely.
  A useless benchmark is useless.
  In real life, people overwhelmingly do not need LLMs to count occurrences of a certain letter in a word.
  
  1 reply →
- observationist 2 years ago
  
  Count the number of occurrences of the letter e in the word "enterprise".
  Problems can exist as instances of a class of problems. If you can't solve a problem, it's useful to know if it's a one off, or if it belongs to a larger class of problems, and which class it belongs to. In this case, the strawberry problem belongs to the much larger class of tokenization problems - if you think you've solved the tokenization problem class, you can test a model on the strawberry problem, with a few other examples from the class at large, and be confident that you've solved the class generally.
  It's not about embodied human constraints or how humans do things; it's about what AI can and can't do. Right now, because of tokenization, things like understanding the number of Es in strawberry are outside the implicit model of the word in the LLM, with downstream effects on tasks it can complete. This affects moderation, parsing, generating prose, and all sorts of unexpected tasks. Having a workaround like forcing the model to insert spaces and operate on explicitly delimited text is useful when affected tasks appear.
- Zambyte 2 years ago
  
  Humans also would probably be very likely to guess 2 r's if they had never seen any written words or had the word spelled out to them as individual letters before, which is kind of close to how lanugage models treat it, despite being a textual interface.
- huac 2 years ago
  
  > Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe.
  We are also not exactly looking letter by letter at everything we read.
  
  6 replies →
- bhelkey 2 years ago
  
  It's not a human. I imagine if you have a use case where counting characters is critical, it would be trivial to programmatically transform prompts into lists of letters.
  A token is roughly four letters [1], so, among other probable regressions, this would significantly reduce the effective context window.
  [1] https://help.openai.com/en/articles/4936856-what-are-tokens-...
  
  1 reply →
- amluto 2 years ago
  
  Humans do chain-of-thought.
  User: Write “strawberry” one letter at a time, with a space between each letter. Then count how many r’s are in strawberry.
  gpt-3.5-turbo: ASSISTANT s t r a w b e r r y
  There are 2 r's in strawberry.
  After some experimenting, it seems like the actual problem is that many LLMs can’t count.
  User: How many r’s are in the following sequence of letters:
  S/T/R/A/W/B/E/R/R/Y
  gpt-4o-mini: In the sequence S/T/R/A/W/B/E/R/R/Y, there are 2 occurrences of the letter "R."
  Oddly, if I change a bunch of the non-R letters, I seem to start getting the right answer.
- lucubratory 2 years ago
  
  >I don't ask someone how many r's there are in strawberry by spelling out strawberry, I just say the word.
  You don't ask a human being how many r's there are in strawberry at all. The only reason you or anyone else asks that question is because it's an interesting quirk of how LLMs work that they struggle to answer it in that format. It's like an alien repeatedly showing humans an optical illusion that relies on the existence of our (literal) blind spot and using it as evidence of our supposed lack of intelligence.
- soneca 2 years ago
  
  This is only an issue if you send commands to a LLM as you were communicating to a human.
  
  2 replies →
WhitneyLand 2 years ago
Where did you get this idea from?
Tokens aren’t the source of facts within a model. it’s an implementation detail and doesn’t inherently constrain how things could be counted.
- observationist 2 years ago
  
  Tokens are the first form of information being encoded into the model. They're statistically guided, more or less a compression dictionary comparable to a Lempel Ziv setup.
  Combinations of tokens get encoded, so if the feature isn't part of the information being carried forward into the network as it models the information in the corpus, the feature isn't modeled well, or at all. The consequence of having many character tokens is that the relevance of individual characters is lost, and you have to explicitly elicit the information. Models know that words have individual characters, but "strawberry" isn't encoded as a sequence of letters, it's encoded as an individual feature of the tokenizer embedding.
  Other forms of tokenizing have other tradeoffs. The trend lately is to increase tokenizer dictionary scope, up to 128k in Llama3 from 50k in gpt-3. The more tokens, the more nuanced individual embedding features in that layer can be before downstream modeling.
  Tokens inherently constrain how the notion of individual letters are modeled in the context of everything an LLM learns. In a vast majority of cases, the letters don't matter, so the features don't get mapped and carried downstream of the tokenizer.
  
  1 reply →
- Kuinox 2 years ago
  
  There isn't a lot of place that teach the AI which letters there is in each tokens. It's a made up concept, and the AI doesn't have enough information in the dataset about this concept, it can difficulty generalize this.
  There is a lot of problems like that, that can be reformulated. For example if you ask it what is the biggest between 9.11 and 9.9, it will respond 9.9. If you look at how it's tokenized, you can see it restate an easy problem as something not straightforward even for a human. if you restart the problem by writing the number in full letters, it will correctly respnod.
  
  1 reply →

generalizations 2 years ago

Testing models on their tokenization has always struck me as kinda odd. Like, that has nothing to do with their intelligence.

ca_tech 2 years ago

Its like showing someone a color and asking how many letters it has. 4... 3? blau, blue, azul, blu The color holds the meaning and the words all map back.
In the model the individual letters hold little meaning. Words are composed of letters but simply because we need some sort of organized structure for communication that helps represents meaning and intent. Just like our color blue/blau/azul/blu.
Not faulting them for asking the question but I agree that the results do not undermine the capability of the technology. In fact it just helps highlight the constraints and need for education.
SirMaster 2 years ago
How is a layman supposed to even know that it's testing on that? All they know is it's a large language model. It's not unreasonable they should expect it to be good at things having to do with language, like how many letters are in a word.
Seems to me like a legit question for a young child to answer or even ask.
- stavros 2 years ago
  
  > How is a layman supposed to even know that it's testing on that?
  They're not, but laymen shouldn't think that the LLM tests they come up with have much value.
  
  3 replies →
wruza 2 years ago
It doesn’t test “on tokenization” though. What happens when an answer is generated is few abstraction levels deeper than tokens. A “thinking” “slice” of an llm is completely unaware of tokens as an immediate part of its reasoning. The question just shows lack of systemic knowledge about strawberry as a word (which isn’t surprising, tbh).
- qeternity 2 years ago
  
  It is. Strawberry is one token in many tokenziers. The model doesn't have a concept that there are letters there.
  
  12 replies →
meroes 2 years ago
I hear this a lot but there are vast sums of money thrown at where a model fails the strawberry cases.
Think about math and logic. If a single symbol is off, it’s no good.
Like a prompt where we can generate a single tokenization error at my work, by my very rough estimates, generates 2 man hours of work. (We search for incorrect model responses, get them to correct themselves, and if they can’t after trying, we tell them the right answer, and edit it for perfection). Yes even for counting occurrences of characters. Think about how applicable that is. Finding the next term in a sequence, analyzing strings, etc.
- antonvs 2 years ago
  
  > Think about math and logic. If a single symbol is off, it’s no good.
  In that case the tokenization is done at the appropriate level.
  This is a complete non-issue for the use cases these models are designed for.
  
  1 reply →
fmbb 2 years ago
> that has nothing to do with their intelligence.
Of course. Because these models have no intelligence.
Everyone who believes they do seem to believe intelligence derives from being able to use language, however, and not being able to tell how many times the letter r is in the word strawberry is a very low bar to not pass.
- roywiggins 2 years ago
  
  An LLM trained on single letter tokens would be able to, it just would be much more laborious to train.
  
  4 replies →
swatcoder 2 years ago

Surfacing and underscoring obvious failure cases for general "helpful chatbot" use is always going to be valuable because it highlights how the "helpful chatbot" product is not really intuitively robust.
Meanwhile, it helps make sure engineers and product designers who want to build a more targeted product around LLM technology know that it's not suited to tasks that may trigger those kinds of failures. This may be obvious to you as an engaged enthusiast or cutting edge engineer or whatever you are, but it's always going to be new information to somebody as the field grows.
VincentEvans 2 years ago
I don’t know anything about LLMs beyond using ChatGPT and Copilot… but unless because of this lack of knowledge I am misinterpreting your reply - it sounds as if you are excusing the model giving a completely wrong answer to a question that anyone intelligent enough to learn alphabet can answer correctly.
- microtonal 2 years ago
  
  The problem is that the model never gets to see individual letters. The tokenizers used by these models break up the input in pieces. Even though the smallest pieces/units are bytes in most encodings (e.g. BBPE), the tokenizer will cut up most of the input in much larger units, because the vocabulary will contain fragments of words or even whole words.
  For example, if we tokenize Welcome to Hacker News, I hope you like strawberries. The Llama 405B tokenizer will tokenize this as:
  Welcome Ġto ĠHacker ĠNews , ĠI Ġhope Ġyou Ġlike Ġstrawberries .
  (Ġ means that the token was preceded by a space.)
  Each of these pieces is looked up and encoded as a tensor with their indices. Adding a special token for the beginning and end of the text, giving:
  [128000, 14262, 311, 89165, 5513, 11, 358, 3987, 499, 1093, 76203, 13]
  So, all the model sees for 'Ġstrawberries' is the number 76204 (which is then used in the piece embedding lookup). The model does not even have access to the individual letters of the word.
  Of course, one could argue that the model should be fed with bytes or codepoints instead, but that would make them vastly less efficient with quadratic attention. Though machine learning models have done this in the past and may do this again in the future.
  Just wanted to finish of this comment with saying that the tokens might be provided in the model splitted if the token itself is not in the vocabulary. For instance, the same sentence translated to my native language is tokenized as:
  Wel kom Ġop ĠHacker ĠNews , Ġik Ġhoop Ġdat Ġje Ġvan Ġa ard be ien Ġh oud t .
  And the word voor strawberries (aardbeien) is split, though still not in letters.
  
  12 replies →
psb217 2 years ago
How can I know whether any particular question will test a model on its tokenization? If a model makes a boneheaded error, how can I know whether it was due to lack of intelligence or due to tokenization? I think finding places where models are surprisingly dumb is often more informative than finding particular instances where they seem clever.
It's also funny, since this strawberry question is one where a model that's seriously good at predicting the next character/token/whatever quanta of information would get it right. It requires no reasoning, and is unlikely to have any contradicting text in the training corpus.
- viraptor 2 years ago
  
  > How can I know whether any particular question will test a model on its tokenization?
  Does something deal with separate symbols rather than just meaning of words? Then yes.
  This affects spelling, math (value calculation), logic puzzles based on symbols. (You'll have more success with a puzzle about "A B A" rather than "ABA")
  > It requires no reasoning, and is unlikely to have any contradicting text in the training corpus.
  This thread contains contradictions. Every other announcement of an llm contains a comment with a contradicting text when people post the wrong responses.
AuryGlenz 2 years ago

I suppose what models should have are some instructions of things they aren’t good at and will need to break out into python code or what have you. Humans have an intuition for this - I have basic knowledge of when I need to write something down or use a calculator. LLMs don’t have intuition (yet - though I suppose one could use a smaller model for that), so explicit instructions would work for now.
furyofantares 2 years ago

It's not very interesting when they fail at it, but it will be interesting if they get good at it.
Also there are some cases where regular people will stumble into it being awful at this without any understanding why (like asking it to help them with their wordle game.)
baq 2 years ago
Call me when models understand when to convert the token into actual letters and count them. Can’t claim they’re more than word calculators before that.
- antonvs 2 years ago
  
  That's misleading.
  When you read and comprehend text, you don't read it letter by letter, unless you have a severe reading disability. Your ability to comprehend text works more like an LLM.
  Essentially, you can compare the human brain to a multi-model or modular system. There are layers or modules involved in most complex tasks. When reading, you recognize multiple letters at a time[], and those letters are essentially assembled into tokens that a different part of your brain can deal with.
  Breaking down words into letters is essentially a separate "algorithm". Just like your brain, it's likely to never make sense for a text comprehension and generation model to operate at the level of letters - it's inefficient.
  A multi-modal model with a dedicated model for handling individual letters could easily convert tokens into letters and operate on them when needed. It's just not a high priority for most use cases currently.
  []https://www.researchgate.net/publication/47621684_Letters_in...
  
  2 replies →
- patall 2 years ago
  
  The model communicates in a language, but our letters are not necessary for such and in fact not part of the english language. You could write english using per word pictographs and it would still be the same english&the same information/message. It's like asking you if there is a '5' in 256 but you read binary.
- jahsome 2 years ago
  
  Is anyone in the know, aside from mainstream media (god forgive me for using this term unironically) and civillians on social media claiming LLMs are anything but word calculators?
  I think that's a perfect description by the way, I'm going to steal it.
  
  8 replies →
b3nny 2 years ago

[dead]
onlyrealcuzzo 2 years ago
> Like, that has nothing to do with their intelligence.
Because they don't have intelligence.
If they did, they could count the letters in strawberry.
- TwentyPosts 2 years ago
  
  People have been over this. If you believe this, you don't understand how LLMs work.
  They fundamentally perceive the world in terms of tokens, not "letters".
  
  1 reply →
probably_wrong 2 years ago
I would counterargue with "that's the model's problem, not mine".
Here's a thought experiment: if I gave you 5 boxes and told you "how many balls are there in all of this boxes?" and you answered "I don't know because they are inside boxes", that's a fail. A truly intelligent individual would open them and look inside.
A truly intelligent model would (say) retokenize the word into its individual letters (which I'm optimistic they can) and then would count those. The fact that models cannot do this is proof that they lack some basic building blocks for intelligence. Model designers don't get to argue "we are human-like except in the tasks where we are not".
- pegasus 2 years ago
  
  Of course they lack building blocks for full intelligence. They are good at certain tasks, and counting letters is emphatically not one of them. They should be tested and compared on the kind of tasks they're fit for, and so the kind of tasks they will be used in solving, not tasks for which they would be misemployed to begin with.
  
  3 replies →
- sebastiennight 2 years ago
  
  Ah, so Nick Vujicic[0] would fail your "balls in a box" test, and is not an intelligent entity.
  [0]: https://genius-u-attachments.s3.amazonaws.com/uploads/articl...
- jackbrookes 2 years ago
  
  It just needs a little hint
  Me: spell "strawberry" with 1 bullet point per letter ChatGPT: S T R A W B E R R Y Me: How many Rs? ChatGPT: There are three Rs in "strawberry".
  
  2 replies →
- groby_b 2 years ago
  
  LLMs are not truly intelligent.
  Never have been, never will be. They model language, not intelligence.
  
  5 replies →
- michaelmrose 2 years ago
  
  Those who develop AI that know anything don't actually describe current technology as human like intelligence rather it is capable of many tasks which previously required human intelligence.

tikkun 2 years ago

When using a prompt that involves thinking first, all three get it correct.

"Count how many rs are in the word strawberry. First, list each letter and indicate whether it's an r and tally as you go, and then give a count at the end."

Llama 405b: correct

Mistral Large 2: correct

Claude 3.5 Sonnet: correct

jedberg 2 years ago
This reminds me of when I had to supervise outsourced developers. I wanted to say "build a function that does X and returns Y". But instead I had to say "build a function that takes these inputs, loops over them and does A or B based on condition C, and then return Y by applying Z transformation"
At that point it was easier to do it myself.
- mratsim 2 years ago
  
  Exact instruction challenge https://www.youtube.com/watch?v=cDA3_5982h8
  
  1 reply →
layer8 2 years ago
It’s not impressive that one has to go to that length though.
- unshavedyak 2 years ago
  
  Imo it's impressive that any of this even remotely works. Especially when you consider all the hacks like tokenization that i'd assume add layers of obfuscation.
  There's definitely tons of weaknesses with LLMs for sure, but i continue to be impressed at what they do right - not upset at what they do wrong.
- mattnewton 2 years ago
  
  You can always find something to be unimpressed by I suppose, but the fact that this was fixable with plain english is impressive enough to me.
  
  4 replies →
- petesergeant 2 years ago
  
  > In a park people come across a man playing chess against a dog. They are astonished and say: "What a clever dog!" But the man protests: "No, no, he isn't that clever. I'm leading by three games to one!"
- Spivak 2 years ago
  
  To me it's just a limitation based on the world as seen by these models. They know there's a letter called 'r', they even know that some words start with 'r' or have r's in them, and they know what the spelling of some words is. But they've never actually seen one in as their world is made up entirely of tokens. The word 'red' isn't r-e-d but is instead like a pictogram to them. But they know the spelling of strawberry and can identify an 'r' when it's on its own and count those despite not being able to see the r's in the word itself.
  
  2 replies →
- jonas21 2 years ago
  
  To be fair, I just asked a real person and had to go to even greater lengths:
  Me: How many "r"s are in strawberry?
  Them: What?
  Me: How many times does the letter "r" appear in the word "strawberry"?
  Them: Is this some kind of trick question?
  Me: No. Just literally, can you count the "r"s?
  Them: Uh, one, two, three. Is that right?
  Me: Yeah.
  Them: Why are you asking me this?
  
  3 replies →
- ThrowawayTestr 2 years ago
  
  Compared to chat bots of even 5 years ago the answer of two is still mind-blowing.
- asadm 2 years ago
  
  this can be automated.
  
  3 replies →
pegasus 2 years ago

Appending "Think step-by-step" is enough to fix it for both Sonnet and LLama 3.1 70B.
For example, the latter model answered with:
To count the number of Rs in the word "strawberry", I'll break it down step by step:
Start with the individual letters: S-T-R-A-W-B-E-R-R-Y Identify the letters that are "R": R (first one), R (second one), and R (third one) Count the total number of Rs: 1 + 1 + 1 = 3
There are 3 Rs in the word "strawberry".
tcgv 2 years ago
Chain-of-Thought (CoT) prompting to the rescue!
We should always put some effort into prompt engineering before dismissing the potential of generative AI.
- IncreasePosts 2 years ago
  
  Why doesn't the model prompt engineer itself?
  
  1 reply →
- johntb86 2 years ago
  
  By this point, instruction tuning should include tuning the model to use chain of thought in the appropriate circumstances.
hansworst 2 years ago

Can’t you just instruct your llm of choice to transform your prompts like this for you? Basically feed it with a bunch of heuristics that will help it better understand the thing you tell it.
Maybe the various chat interfaces already do this behind the scenes?

joshstrange 2 years ago

Lots of replies mention tokens as the root cause and I’m not well versed in this stuff at the low level but to me the answer is simple:

When this question is asked (from what the models trained on) the question is NOT “count the number of times r appears in the word strawberry” but instead (effectively) “I’ve written ‘strawbe’, now how many r’s are in strawberry again? Is it 1 or 2?”.

I think most humans would probably answer “there are 2” if we saw someone was writing and they asked that question, even without seeing what they have written down. Especially if someone said “does strawberry have 1 or 2 r’s in it?”. You could be a jerk and say “it actually has 3” or answer the question they are actually asking.

It’s an answer that is _technically_ incorrect but the answer people want in reality.

Tepix 2 years ago

LLMs think in tokens, not letters. It's like asking someone who is dyslexic about spelling. Not their strong suit. In practice, it doesn't matter much, does it?

recursive 2 years ago
Sometimes it does, sometimes it doesn't.
It is evidence that LLMs aren't appropriate for everything, and that there could exist something that works better for some tasks.
- Zambyte 2 years ago
  
  Language models are best treated like consciousness. Our consciousness does a lot less than people like to attribute to it. It is mostly a function of introspection and making connections, rather than being the part of the brain where higher level reasoning and the functions of the brain that tell your body how to stay alive (like beating your heart).
  By allowing a language model to do function calling, you are essentially allowing it to do specialized "subconscious" thought. The language model becomes a natural language interface to the capabilities of its "subconsciousness".
  A specific human analogy could be: I tell you to pick up a pen off of the table, and then you do it. Most of your mental activity would be subconscious, orienting your arm and hand properly to pick up the pen, actually grabbing the pen, and picking it up. The linguistic representation of the action would exist in your concious mind (pick up the pen), but not much else.
  A language model could very easily call out to a text processing function to correctly do things like count the number of r's in the word strawberry. That is a job that your concious mind can dispatch to your subconciousness.
  
  2 replies →
imtringued 2 years ago
This doesn't make any sense. The LLM knows the token to letter associations if you ask it. Once it has split the token into letters, it can answer the question anyway.
This tells us that the model is unaware of its own limitations, otherwise it would have figured that this tasks needs it to split the letters first.
- staticman2 2 years ago
  
  Well splitting into letters doesn't help with Mistral Large 2. I tried this with Mistral 2:
  [Question: How many r's in strawberry? First split the word into letters, one letter on each line then answer the question.]
  [Answer: Let's split the word "strawberry" into letters:
  s t r a w b e r r y
  Now, let's count the number of 'r's:
  There are 2 'r's in the word "strawberry".]
  [Question: Did you count the number of r's?]
  [answer: Yes, I did. Here's the count:
  r r
  So, there are 2 'r's in the word "strawberry".]
  I suspect the issue is these models have to be trained to accomplish tasks, and this is not a task it makes sense to train a model to perform. You might train it to do this with function calling one day, calling a python library to count the r's in a string, but actually manually counting r's just doesn't play to an LLM's strengths.
- wruza 2 years ago
  
  Does it? It knows how a word in some context represented by tokens split into letters. It may know or not know the table. I wouldn’t trust what it tells about word/token correspondence more than in general.

doctoboggan 2 years ago

Due to the fact that LLMs work on tokens and not characters, these sort of questions will always be hard for them.

ipython 2 years ago

I really wanted to believe this could be simply fixed by more prompt engineering and chain of thought, so I tried it with Claude 3.5 and … well unfortunately still no luck. the reply is amusing - I’ll include it verbatim below.

Of note, this happened because earlier I had asked it without CoT and it incorrectly stated two “r”s. Once I started a new conversation (so fresh context window), it gave the right answer (3) and no longer claimed the “still the second r” as shown below.

Let's think through this step-by-step:

1. First, let's spell out "strawberry" letter by letter: s-t-r-a-w-b-e-r-r-y

2. Now, let's count the "r"s as we go through each letter: s (0) t (0) r (1) - First "r" found a (1) w (1) b (1) e (1) r (2) - Second "r" found r (2) - This is still the second "r" y (2)

3. After going through all the letters, we've counted 2 "r"s.

Therefore, the word "strawberry" contains 2 "r"s.

ryanpanwar 2 years ago
Works for me with Claude 3.5:
Me: How many Rs are in the word strawberry? Think step by step
Claude: Let's approach this step-by-step:
1. First, let's write out the word: strawberry
2. Now, let's go through each letter one by one: s - not an R t - not an R r - this is an R (count: 1) a - not an R w - not an R b - not an R e - not an R r - this is another R (count: 2) r - this is a third R (count: 3) y - not an R
3. After going through all the letters, we've counted 3 Rs.
Therefore, there are 3 Rs in the word "strawberry".
- ipython 2 years ago
  
  Yeah, I figured out that it didn’t work at first because I still had the previous attempt (where it got it wrong) in my conversation history. Starting with a fresh conversation gave me the correct answer. It was still funny to see it “rationalize” the wrong answer tho.
takumif 2 years ago

For these classes of problems that LLMs struggle with, a more reliable way to go about them seems to be to ask them to solve them using tools, e.g. writing and executing a Python script to count the "R"s.

stitched2gethr 2 years ago

Interestingly enough much simpler models can write an accurate function to give you the answer.

I think it will be a while before we get there. An LLM can lookup knowledge but can't actually perform calculations itself, without some external processor.

stanleydrew 2 years ago

Why do we have to "get there?" Humans use calculators all the time, so why not have every LLM hooked up to a calculator or code interpreter as a tool to use in these exact situations?

vorticalbox 2 years ago

I just tried llama 3.1 8 b this is its reply.

According to multiple sources, including linguistic analysis and word breakdowns, there are 3 Rs in the word "strawberry".

medmunds 2 years ago

How much do threads like this provide the training data to convince future generations that—despite all appearances to the contrary—strawberry is in fact spelled with only two R's?

I just researched "how many r's are in strawberry?" in a search engine, and based solely on the results it found, I would have to conclude there is substantial disagreement on whether the correct answer is two or three.

fluoridation 2 years ago

Speaking as a 100% human, my vote goes to the compromise position that "strawberry" has in fact four Rs.

m2024 2 years ago

Does anyone have input on the feasibility of running an LLM locally and providing an interface to some language runtime and storage space, possibly via a virtual machine or container?

No idea if there's any sense to this, but an LLM could be instructed to formulate and continually test mathematical assumptions by writing / running code and fine-tuning accordingly.

stanleydrew 2 years ago

Yes, we are doing this at Riza[0] (via WASM). I'd love to have folks try our downloadable CLI which wraps isolated Python/JS runtimes (also Ruby/PHP but LLMs don't seem to write those very well). Shoot me an email[1] or say hi in Discord[1].
[0]:https://riza.io [1]:mailto:andrew@riza.io [2]:https://discord.gg/4P6PUeJFW5
killthebuddha 2 years ago

FWIW this (approximately) is what everybody (approximately) is trying to do.

Terr_ 2 years ago

I'm not surprised, because it's an issue with the fundamental design of the "pick words that tend to flow after the other words" machine.

Training data will only "fix" it in the shallow sense that it will have seen a comment like yours before. (As opposed to the deeper sense of "learning to count.")

kremi 2 years ago

Your question sounds like you want to know how the word is spelled, and no one would put two r’s at straw, so the model could be assuming that you’re asking whether it’s strawbery or strawberry.

What happens if you ask the total number of occurrences of the letter r in the word? Does it still not get it right?

exebook 2 years ago

Tokenization can be bypassed like this; (mistral large 2)

How many letters R are in the word "s-t-r-a-w-b-e-r-r-y"?

The word "s-t-r-a-w-b-e-r-r-y" contains three instances of the letter "R."

How many letters R contain the word strawberry?

The word "strawberry" contains two instances of the letter "R."

stanleydrew 2 years ago

Plug in a code interpreter as a tool and the model will write Python or JavaScript to solve this and get it right 100% of the time. (Full disclosure: I work on a product called Riza that you can use as a code interpreter tool for LLMs)

sashank_1509 2 years ago

While strawberry can be attributed to tokenization here are some other basic stuff I’ve seen language models fail at:

1. Play tic tac toe such that you never lose

2. Which is bigger 9.11 or 9.9

3. 4 digit multiplication even with CoT prompting

ChikkaChiChi 2 years ago

4o will get the answer right on the first go if you ask it "Search the Internet to determine how many R's are in strawberry?" which I find fascinating

paulcole 2 years ago
I didn't even need to do that. 4o got it right straight away with just:
"how many r's are in strawberry?"
The funny thing is, I replied, "Are you sure?" and got back, "I apologize for the mistake. There are actually two 'r's in the word strawberry."
- ofrzeta 2 years ago
  
  I kind of tried to replicate your experiment (in German where "Erdbeere" has 4 E) that went the same way. The interesting thing was that after I pointed out the error I couldn't get it to doubt the result again. It stuck to the correct answer that seemed kind of "reinforced".
  It was also interesting to observe how GPT (4o) even tried to prove/illustrate the result typographically by placing the same word four times and putting the respective letter in bold font (without being prompted to do that).
- jcheng 2 years ago
  
  GPT-4o-mini consistently gives me this:
  > How many times does the letter “r” appear in the word “strawberry”?
  > The letter "r" appears 2 times in the word "strawberry."
  But also:
  > How many occurrences of the letter “r” appear in the word “strawberry”?
  > The word "strawberry" contains three occurrences of the letter "r."
  
  1 reply →
- brandall10 2 years ago
  
  All that's happening is it finds 3 most commonly in the training set. When you push it, it responds with the next most common answer.
  
  6 replies →

0x1ceb00da 2 years ago

> how many r's are in strawberry

How many thoughts go through your brain when you read this comment? You can give me a number but it will be a guess at best.

Der_Einzige 2 years ago

I wrote and published a paper at COLING 2022 on why LLMs in general won't solve this without either 1. radically increasing vocab size, 2. rethinking how tokenizers are done, or 3. forcing it with constraints:

https://aclanthology.org/2022.cai-1.2/

Stumbling 2 years ago

Claude 3 Opus gave correct answer.

jbritton 2 years ago

I just tried Copilot 2 ‘r’ in strawberry, 3 ‘r’ in strawberries.

mirekrusin 2 years ago

How many "r"s are in [496, 675, 15717]?

eschneider 2 years ago

The models are text generators. They don't "understand" the question.

taf2 2 years ago

sonate 3.5 thinks 2

groby_b 2 years ago

I mean, yes? Letter-based gymnastics are difficult in a token-based environment, and they're not a particularly valuable use case (outside of online dunking), so it's not really a surprise nobody cares much about those in terms of model training.