It can spell the word (writing each letter in uppercase followed by a whitespace, which should turn each letter with its whitespace into a separate token). It also has reasoning tokens to use as scratch space, and previous models have demonstrated knowledge of the fact that spelling words is a useful step to counting letters.
Tokenization makes the problem difficult, but not solving it is still a reasoning/intelligence issue
Here's an example of what gpt-oss-20b (at the default mxfp4 precision) does with this question:
> How many "s"es are in the word "Mississippi"?
The "thinking portion" is:
> Count letters: M i s s i s s i p p i -> s appears 4 times? Actually Mississippi has s's: positions 3,4,6,7 = 4.
The answer is:
> The word “Mississippi” contains four letter “s” s.
They can indeed do some simple pattern matching on the query, separate the letters out into separate tokens, and count them without having to do something like run code in a sandbox and ask it the answer.
The issue here is just that this workaround/strategy is only trained into the "thinking" models, afaict.
I'll be impressed when you can reliably give them a random four-word phrase for this test. Because I don't think anyone is going to try to teach them all those facts; even if they're trained to know letter counts for every English word (as the other comment cites as a possibility), they'd then have to actually count and add, rather than presenting a known answer plus a rationalization that looks like counting and adding (and is easy to come up with once an answer has already been decided).
(Yes, I'm sure an agentic + "reasoning" model can already deduce the strategy of writing and executing a .count() call in Python or whatever. That's missing the point.)
You can even ask it to go letter-by-letter and it'll get the answer right. The information to get it right is definitely in there somewhere, it just doesn't by default.
It clearly is an artifact of tokenization, but I don’t think it’s a “just”. The point is precisely that the GPT system architecture cannot reliably close the gap here; it’s almost able to count the number of Bs in a string, there’s no fundamental reason you could not build a correct number-of-Bs mapping for tokens, and indeed it often gets the right answer. But when it doesn’t you can’t always correct it with things like chain of thought reasoning.
This matters because it poses a big problem for the (quite large) category of things where people expect LLMs to be useful when they get just a bit better. Why, for example, should I assume that modern LLMs will ever be able to write reliably secure code? Isn’t it plausible that the difference between secure and almost secure runs into some similar problem?
It's like someone has given a bunch of young people hundreds of billions of dollars to build a product that parses HTML documents with regular expressions.
It's not in their interest to write off the scheme as provably unworkable at scale, so they keep working on the edge cases until their options vest.
I tried to reproduce it again just now, and ChatGPT 5 seems to be a lot more meticulous about running a python script to double-check its work, which it tells me is because it has a warning in its system prompt telling it to. I don't know if that's proof (or even if ChatGPT reliably tells the truth about what's in its system prompt), but given what OpenAI does and doesn't publish it's the closest I could reasonably expect.
Common misconception. That just means the algorithm for counting letters can't be as simple as adding 1 for every token. The number of distinct tokens is tiny compared to the parameter space, and it's not infeasible to store a mapping from token type to character count in those weights.
If you're fine appealing to less concrete ideas, transformers are arbitrary function approximators, tokenization doesn't change that, and there are proofs of those facts.
For any finite-length function (like counting letters in a bounded domain), it's just a matter of having a big enough network and figuring out how to train it correctly. They just haven't bothered.
> The number of distinct tokens is tiny compared to the parameter space, and it's not infeasible to store a mapping from token type to character count in those weights.
You seem to suppose that they actually perform addition internally, rather than simply having a model of the concept that humans sometimes do addition and use it to compute results. Why?
> For any finite-length function (like counting letters in a bounded domain), it's just a matter of having a big enough network and figuring out how to train it correctly. They just haven't bothered.
The problem is that the question space grows exponentially in the length of input. If you want a non-coincidentally-correct answer to "how many t's in 'correct horse battery staple'?" then you need to actually add up the per-token counts.
> You seem to suppose that they actually perform addition internally, rather than simply having a model of the concept that humans sometimes do addition and use it to compute results. Why?
Nothing of the sort. They're _capable_ of doing so. For something as simple as addition you can even hand-craft weights which exactly solve it.
> The problem is that the question space grows exponentially in the length of input. If you want a non-coincidentally-correct answer to "how many t's in 'correct horse battery staple'?" then you need to actually add up the per-token counts.
Yes? The architecture is capable of both mapping tokens to character counts and of addition with a fraction of their current parameter counts. It's not all that hard.
Or they don't see the benefit. I'm sure they could train the representation of every token and make spelling perfect. But if you have real users spending money on useful tasks already - how much money would you spend on training answers to meme questions that nobody will pay for. They did it once for the fun headline already and apparently it's not worth repeating.
No, it's the entire architecture of the model. There's no real reasoning. It seems that reasoning is just a feedback loop on top of existing autocompletion.
It's really disingenuous for the industry to call warming tokens for output, "reasoning," as if some autocomplete before more autocomplete is all we needed to solve the issue of consciousness.
Edit: Letter frequency apparently has just become another scripted output, like doing arithmetic. LLMs don't have the ability to do this sort of work inherently, so they're trained to offload the task.
Edit: This comment appears to be wildly upvoted and downvoted. If you have anything to add besides reactionary voting, please contribute to the discussion.
In ten years time an LLM lawyer will lose a legal case for someone who can no longer afford a real lawyer because there are so few left. And it'll be because the layers of bodges in the model caused it to go crazy, insult the judge and threaten to burn down the courthouse.
There will be a series of analytical articles in the mainstream press, the tech industry will write it off as a known problem with tokenisation that they can't fix because nobody really writes code anymore.
The LLM megacorp will just add a disclaimer: the software should not be used in legal actions concerning fruit companies and they disclaim all losses.
I glumly predict LLMs will end up a bit like asbestos: Powerful in some circumstances, but over/mis-used, hurting people in a way that will be difficult to fix later.
> Edit: Letter frequency apparently has just become another scripted output, like doing arithmetic. LLMs don't have the ability to do this sort of work inherently, so they're trained to offload the task.
Mechanistic research at the leading labs has shown that LLMs actually do math in token form up to certain scale of difficulty.
> This is a real-time, unedited research walkthrough investigating how GPT-J (a 6 billion parameter LLM) can do addition.
Can we not downvote this, please? It's a good question.
There's prior art for formal logic and knowledge representation systems dating back several decades, but transformers don't use those designs. A transformer is more like a search algorithm by comparison, not a logic one.
That's one issue, but the other is that reasoning comes from logic, and the act of reasoning is considered a qualifier of consciousness. But various definitions of consciousness require awareness, which large language models are not capable of.
Their window of awareness, if you can call it that, begins and ends during processing tokens, and outputting them. As if a conscious thing could be conscious for moments, then dormant again.
That is to say, conscious reasoning comes from awareness. But in tech, severing the humanities here would allow one to suggest that one, or a thing, can reason without consciousness.
I think that's supposed to be the idea of reasoning functionality, but in practice, it just seems to allow responses to continue longer than that would have otherwise by bisecting the output into warming an output and then using maybe what we would consider cached tokens to assist with further contextual lookups.
That is to say, you can obtain the same process by talking to "non-reasoning" models.
> There's no real reasoning. It seems that reasoning is just a feedback loop on top of existing autocompletion.
I like to say that if regular LLM "chats" are actually movie scripts being incrementally built and selectively acted-out, then "reasoning" models are a stereotypical film noir twist, where the protagonist-detective narrates hidden things to himself.
Wrong, it's an artifact of tokenizing. The model doesn't have access to the individual letters, only to the tokens. Reasoning models can usually do this task well - they can spell out the word in the reasoning buffer - the fact that GPT5 fails here is likely a result of it incorrectly answering the question with a non-reasoning version of the model.
> There's no real reasoning.
This seems like a meaningless statement unless you give a clear definition of "real" reasoning as opposed to other kinds of reasoning that are only apparant.
> It seems that reasoning is just a feedback loop on top of existing autocompletion.
The word "just" is doing a lot of work here - what exactly is your criticism here? The bitter lesson of the past years is that relatively simple architectures that scale with compute work surprisingly well.
> It's really disingenuous for the industry to call warming tokens for output, "reasoning," as if some autocomplete before more autocomplete is all we needed to solve the issue of consciousness.
Reasoning and consciousness are seperate concepts. If I showed the output of an LLM 'reasoning' (you can call it something else if you like) to somebody 10 years ago they would agree without any doubt that reasoning was taking place there. You are free to provide a definition of reasoning which an LLM does not meet of course - but it is not enough to just say it is so. Using the word autocomplete is rather meaningless name-calling.
> Edit: Letter frequency apparently has just become another scripted output, like doing arithmetic. LLMs don't have the ability to do this sort of work inherently, so they're trained to offload the task.
Not sure why this is bad. The implicit assumption seems to be that an LLM is only valueable if it literally does everything perfectly?
> Edit: This comment appears to be wildly upvoted and downvoted. If you have anything to add besides reactionary voting, please contribute to the discussion.
Probably because of the wild assertions, charged language, and rather superficial descriptions of actual mechanics.
These aren't wild assertions. I'm not using charged language.
> Reasoning and consciousness are seperate(sic) concepts
No, they're not. But, in tech, we seem to have a culture of severing the humanities for utilitarian purposes, but no, classical reasoning uses consciousness and awareness as elements of processing.
It's only meaningless if you don't know what the philosophical or epistemological definitions of reasoning are. Which is to say, you don't know what reasoning is. So you'd think it was a meaningless statement.
Do computers think, or do they compute?
Is that a meaningless question to you? I'm sure given your position it's irrelevant and meaningless, surely.
And this sort of thinking is why we have people claiming software can think and reason.
> It's really disingenuous for the industry to call warming tokens for output, "reasoning," as if some autocomplete before more autocomplete is all we needed to solve the issue of consciousness.
There's no obvious connection between reasoning and consciousness. It seems perfectly possible to have a model that can reason without being conscious.
Also, dismissing what these models do as "autocomplete" is extremely disingenuous. At best it implies you're completely unfamiliar with the state of the art, at worst it implies an dishonest agenda.
In terms of functional ability to reason, these models can beat a majority of humans in many scenarios.
Understanding is always functional, we don't study medicine before going to the doctor, we trust the expert. Like that we do with almost every topic or system. How do you "understand" a company or a complex technological or biological system? Probably nobody does end to end. We can only approximate it with abstractions and reasoning. Not even a piece of code can be understood - without execution we can't tell if it will halt or not.
It would require you to change the definition of reasoning, or it would require you to believe computers can think.
A locally trained text-based foundation model is indistinguishable from autocompletion, and outputs very erratic text, and the further you train it's ability to diminish irrelevant tokens, or guide it to produce specifically formatted output, you've just moved its ability to curve fit specific requirements.
So it may be disingenuous to you, but it does behave very much like a curve fitting search algorithm.
LLMs don’t see token ids, they see token embeddings that map to those ids, and those embeddings are correlated. The hypothetical embeddings of 538, 423, 4144, and 9890 are likely strongly correlated in the process of training the LLM and the downstream LLM should be able to leverage those patterns to solve the question correctly. Even more so since the training process likely has many examples of similar highly correlated embeddings to identify the next similar token.
It can spell the word (writing each letter in uppercase followed by a whitespace, which should turn each letter with its whitespace into a separate token). It also has reasoning tokens to use as scratch space, and previous models have demonstrated knowledge of the fact that spelling words is a useful step to counting letters.
Tokenization makes the problem difficult, but not solving it is still a reasoning/intelligence issue
Here's an example of what gpt-oss-20b (at the default mxfp4 precision) does with this question:
> How many "s"es are in the word "Mississippi"?
The "thinking portion" is:
> Count letters: M i s s i s s i p p i -> s appears 4 times? Actually Mississippi has s's: positions 3,4,6,7 = 4.
The answer is:
> The word “Mississippi” contains four letter “s” s.
They can indeed do some simple pattern matching on the query, separate the letters out into separate tokens, and count them without having to do something like run code in a sandbox and ask it the answer.
The issue here is just that this workaround/strategy is only trained into the "thinking" models, afaict.
That proves nothing. The fact that Mississippi has 4 "s" is far more likely to be in the training data than the fact that blueberry has 2 "b"s.
And now that fact is going to be in the data for the next round of training. We'll need to need to try some other words on the next model.
1 reply →
I'll be impressed when you can reliably give them a random four-word phrase for this test. Because I don't think anyone is going to try to teach them all those facts; even if they're trained to know letter counts for every English word (as the other comment cites as a possibility), they'd then have to actually count and add, rather than presenting a known answer plus a rationalization that looks like counting and adding (and is easy to come up with once an answer has already been decided).
(Yes, I'm sure an agentic + "reasoning" model can already deduce the strategy of writing and executing a .count() call in Python or whatever. That's missing the point.)
4 replies →
> It also has reasoning tokens to use as scratch space
For GPT 5, it would seem this depends on which model your prompt was routed to.
And GPT 5 Thinking gets it right.
You can even ask it to go letter-by-letter and it'll get the answer right. The information to get it right is definitely in there somewhere, it just doesn't by default.
It clearly is an artifact of tokenization, but I don’t think it’s a “just”. The point is precisely that the GPT system architecture cannot reliably close the gap here; it’s almost able to count the number of Bs in a string, there’s no fundamental reason you could not build a correct number-of-Bs mapping for tokens, and indeed it often gets the right answer. But when it doesn’t you can’t always correct it with things like chain of thought reasoning.
This matters because it poses a big problem for the (quite large) category of things where people expect LLMs to be useful when they get just a bit better. Why, for example, should I assume that modern LLMs will ever be able to write reliably secure code? Isn’t it plausible that the difference between secure and almost secure runs into some similar problem?
It's like someone has given a bunch of young people hundreds of billions of dollars to build a product that parses HTML documents with regular expressions.
It's not in their interest to write off the scheme as provably unworkable at scale, so they keep working on the edge cases until their options vest.
> cannot reliably close the gap here
Have you got any proof they're even trying? It's unlikely that's something their real customers are paying for.
I tried to reproduce it again just now, and ChatGPT 5 seems to be a lot more meticulous about running a python script to double-check its work, which it tells me is because it has a warning in its system prompt telling it to. I don't know if that's proof (or even if ChatGPT reliably tells the truth about what's in its system prompt), but given what OpenAI does and doesn't publish it's the closest I could reasonably expect.
Common misconception. That just means the algorithm for counting letters can't be as simple as adding 1 for every token. The number of distinct tokens is tiny compared to the parameter space, and it's not infeasible to store a mapping from token type to character count in those weights.
If you're fine appealing to less concrete ideas, transformers are arbitrary function approximators, tokenization doesn't change that, and there are proofs of those facts.
For any finite-length function (like counting letters in a bounded domain), it's just a matter of having a big enough network and figuring out how to train it correctly. They just haven't bothered.
> The number of distinct tokens is tiny compared to the parameter space, and it's not infeasible to store a mapping from token type to character count in those weights.
You seem to suppose that they actually perform addition internally, rather than simply having a model of the concept that humans sometimes do addition and use it to compute results. Why?
> For any finite-length function (like counting letters in a bounded domain), it's just a matter of having a big enough network and figuring out how to train it correctly. They just haven't bothered.
The problem is that the question space grows exponentially in the length of input. If you want a non-coincidentally-correct answer to "how many t's in 'correct horse battery staple'?" then you need to actually add up the per-token counts.
> You seem to suppose that they actually perform addition internally, rather than simply having a model of the concept that humans sometimes do addition and use it to compute results. Why?
Nothing of the sort. They're _capable_ of doing so. For something as simple as addition you can even hand-craft weights which exactly solve it.
> The problem is that the question space grows exponentially in the length of input. If you want a non-coincidentally-correct answer to "how many t's in 'correct horse battery staple'?" then you need to actually add up the per-token counts.
Yes? The architecture is capable of both mapping tokens to character counts and of addition with a fraction of their current parameter counts. It's not all that hard.
> They just haven't bothered.
Or they don't see the benefit. I'm sure they could train the representation of every token and make spelling perfect. But if you have real users spending money on useful tasks already - how much money would you spend on training answers to meme questions that nobody will pay for. They did it once for the fun headline already and apparently it's not worth repeating.
That's just a potential explanation for why they haven't bothered. I don't think we're disagreeing.
No, it's the entire architecture of the model. There's no real reasoning. It seems that reasoning is just a feedback loop on top of existing autocompletion.
It's really disingenuous for the industry to call warming tokens for output, "reasoning," as if some autocomplete before more autocomplete is all we needed to solve the issue of consciousness.
Edit: Letter frequency apparently has just become another scripted output, like doing arithmetic. LLMs don't have the ability to do this sort of work inherently, so they're trained to offload the task.
Edit: This comment appears to be wildly upvoted and downvoted. If you have anything to add besides reactionary voting, please contribute to the discussion.
In ten years time an LLM lawyer will lose a legal case for someone who can no longer afford a real lawyer because there are so few left. And it'll be because the layers of bodges in the model caused it to go crazy, insult the judge and threaten to burn down the courthouse.
There will be a series of analytical articles in the mainstream press, the tech industry will write it off as a known problem with tokenisation that they can't fix because nobody really writes code anymore.
The LLM megacorp will just add a disclaimer: the software should not be used in legal actions concerning fruit companies and they disclaim all losses.
I glumly predict LLMs will end up a bit like asbestos: Powerful in some circumstances, but over/mis-used, hurting people in a way that will be difficult to fix later.
> Edit: Letter frequency apparently has just become another scripted output, like doing arithmetic. LLMs don't have the ability to do this sort of work inherently, so they're trained to offload the task.
Mechanistic research at the leading labs has shown that LLMs actually do math in token form up to certain scale of difficulty.
> This is a real-time, unedited research walkthrough investigating how GPT-J (a 6 billion parameter LLM) can do addition.
https://youtu.be/OI1we2bUseI
Please define “real reasoning”? Where is the distinction coming from?
Can we not downvote this, please? It's a good question.
There's prior art for formal logic and knowledge representation systems dating back several decades, but transformers don't use those designs. A transformer is more like a search algorithm by comparison, not a logic one.
That's one issue, but the other is that reasoning comes from logic, and the act of reasoning is considered a qualifier of consciousness. But various definitions of consciousness require awareness, which large language models are not capable of.
Their window of awareness, if you can call it that, begins and ends during processing tokens, and outputting them. As if a conscious thing could be conscious for moments, then dormant again.
That is to say, conscious reasoning comes from awareness. But in tech, severing the humanities here would allow one to suggest that one, or a thing, can reason without consciousness.
1 reply →
In my personal opinion it is reasonable to define "reasoning" as requiring sentience.
Athenian wisdom suggests that fallacious thought is "unreasonable". So reason is the opposite of that.
I had a fun experience recently. I asked one of my daughters how many r's there are in strawberry. Her answer? Two ...
Of course then you ask her to write it and of course things get fixed. But strange.
I think that's supposed to be the idea of reasoning functionality, but in practice, it just seems to allow responses to continue longer than that would have otherwise by bisecting the output into warming an output and then using maybe what we would consider cached tokens to assist with further contextual lookups.
That is to say, you can obtain the same process by talking to "non-reasoning" models.
To be honest, if a kid asked me how many r's in strawberry, I would assume they were asking how many r's at the end and say 2.
I hate to break it to you but I think your child might actually have gotten swapped in the hospital with an LLM.
> There's no real reasoning. It seems that reasoning is just a feedback loop on top of existing autocompletion.
I like to say that if regular LLM "chats" are actually movie scripts being incrementally built and selectively acted-out, then "reasoning" models are a stereotypical film noir twist, where the protagonist-detective narrates hidden things to himself.
> No, it's the entire architecture of the model.
Wrong, it's an artifact of tokenizing. The model doesn't have access to the individual letters, only to the tokens. Reasoning models can usually do this task well - they can spell out the word in the reasoning buffer - the fact that GPT5 fails here is likely a result of it incorrectly answering the question with a non-reasoning version of the model.
> There's no real reasoning.
This seems like a meaningless statement unless you give a clear definition of "real" reasoning as opposed to other kinds of reasoning that are only apparant.
> It seems that reasoning is just a feedback loop on top of existing autocompletion.
The word "just" is doing a lot of work here - what exactly is your criticism here? The bitter lesson of the past years is that relatively simple architectures that scale with compute work surprisingly well.
> It's really disingenuous for the industry to call warming tokens for output, "reasoning," as if some autocomplete before more autocomplete is all we needed to solve the issue of consciousness.
Reasoning and consciousness are seperate concepts. If I showed the output of an LLM 'reasoning' (you can call it something else if you like) to somebody 10 years ago they would agree without any doubt that reasoning was taking place there. You are free to provide a definition of reasoning which an LLM does not meet of course - but it is not enough to just say it is so. Using the word autocomplete is rather meaningless name-calling.
> Edit: Letter frequency apparently has just become another scripted output, like doing arithmetic. LLMs don't have the ability to do this sort of work inherently, so they're trained to offload the task.
Not sure why this is bad. The implicit assumption seems to be that an LLM is only valueable if it literally does everything perfectly?
> Edit: This comment appears to be wildly upvoted and downvoted. If you have anything to add besides reactionary voting, please contribute to the discussion.
Probably because of the wild assertions, charged language, and rather superficial descriptions of actual mechanics.
These aren't wild assertions. I'm not using charged language.
> Reasoning and consciousness are seperate(sic) concepts
No, they're not. But, in tech, we seem to have a culture of severing the humanities for utilitarian purposes, but no, classical reasoning uses consciousness and awareness as elements of processing.
It's only meaningless if you don't know what the philosophical or epistemological definitions of reasoning are. Which is to say, you don't know what reasoning is. So you'd think it was a meaningless statement.
Do computers think, or do they compute?
Is that a meaningless question to you? I'm sure given your position it's irrelevant and meaningless, surely.
And this sort of thinking is why we have people claiming software can think and reason.
2 replies →
> It's really disingenuous for the industry to call warming tokens for output, "reasoning," as if some autocomplete before more autocomplete is all we needed to solve the issue of consciousness.
There's no obvious connection between reasoning and consciousness. It seems perfectly possible to have a model that can reason without being conscious.
Also, dismissing what these models do as "autocomplete" is extremely disingenuous. At best it implies you're completely unfamiliar with the state of the art, at worst it implies an dishonest agenda.
In terms of functional ability to reason, these models can beat a majority of humans in many scenarios.
Understanding is always functional, we don't study medicine before going to the doctor, we trust the expert. Like that we do with almost every topic or system. How do you "understand" a company or a complex technological or biological system? Probably nobody does end to end. We can only approximate it with abstractions and reasoning. Not even a piece of code can be understood - without execution we can't tell if it will halt or not.
It would require you to change the definition of reasoning, or it would require you to believe computers can think.
A locally trained text-based foundation model is indistinguishable from autocompletion, and outputs very erratic text, and the further you train it's ability to diminish irrelevant tokens, or guide it to produce specifically formatted output, you've just moved its ability to curve fit specific requirements.
So it may be disingenuous to you, but it does behave very much like a curve fitting search algorithm.
2 replies →
Where in the tokenization does the 3rd b come from?
The tokenisation means they don’t see the letters at all. They see something like this - to convert just some tokens to words
How many 538 do you see in 423, 4144, 9890?
LLMs don’t see token ids, they see token embeddings that map to those ids, and those embeddings are correlated. The hypothetical embeddings of 538, 423, 4144, and 9890 are likely strongly correlated in the process of training the LLM and the downstream LLM should be able to leverage those patterns to solve the question correctly. Even more so since the training process likely has many examples of similar highly correlated embeddings to identify the next similar token.
1 reply →