Comment by simonw

4 months ago

If you take a look at the system prompt for Claude 3.7 Sonnet on this page you'll see: https://docs.claude.com/en/release-notes/system-prompts#clau...

> If Claude is asked to count words, letters, and characters, it thinks step by step before answering the person. It explicitly counts the words, letters, or characters by assigning a number to each. It only answers the person once it has performed this explicit counting step.

But... if you look at the system prompts on the same page for later models - Claude 4 and upwards - that text is gone.

Which suggests to me that Claude 4 was the first Anthropic model where they didn't feel the need to include that tip in the system prompt.

38 comments

simonw

kristianp 4 months ago

Does that mean they've managed to post train the thinking steps required to get these types of questions correct?

therealpygon 4 months ago
IMO, it’s just a small scale example of “training to the tests” because “count the ‘r’s in strawberry” became such a popular test that would make the news when a powerful model couldn’t answer such a simple question correctly while being advertised as the smartest model ever.
Assigning this as an indicator for improvement of intelligence seems like a mistake (or wishful).
- jononor 4 months ago
  
  If done at scale, they are kinda crowd sourcing the test set from the entire internet, personal and business world. It will be harder and harder at least to pinpoint weaknesses, at least for the general public. It probably has little to do with intelligence (at least fluid intelligence as defined by Chollet et al) - but I guess it is sound tactic if the strategy is "fake it till you make it". And we might be surprised as to how far along that can go...
simonw 4 months ago

That's my best guess, yeah.

ivape 4 months ago

Or they’d rather use that context window space for more useful instructions for a variety of other topics.

astrange 4 months ago
Claude's system prompt is still incredibly long and probably hurting its performance.
https://github.com/asgeirtj/system_prompts_leaks/blob/main/A...
- jazzyjackson 4 months ago
  
  They ain't called guard rails for nothing! There's a whole world "off-road" but the big names are afraid of letting their superintelligence off the leash. A real shame we're letting brand safety get in the way of performance and creativity, but I guess the first New York Times article about a pervert or terrorist chat bot would doom any big name partnerships.
  
  12 replies →

curioussquirrel 4 months ago

Thanks, Simon! I saw the same approach (numbering the individual characters) in GPT 4.1's answer, but not anymore in GPT 5's. It would be an interesting convergence if the models from Anthropic and OpenAI learned to do this at a similar time, especially given they're (reportedly) very different architecturally.

hansmayer 4 months ago

Not trying to be cynical here, but I am genuinely interested is there a reason why these LLM don't/can't/won't apply some deterministic algorithm? I mean, counting characters and such, we have solved those problems ages ago.

simonw 4 months ago
They can. ChatGPT has been able to count characters/words etc flawlessly for a couple of years now if you tell it to "use your Python tool".
- hansmayer 4 months ago
  
  Fair enough. But why do I have to tell them that, should they not be able to figure it out themselves? If I show a 5-year kid once how to use colour pencils, I won't have to show them each time they want to make a drawing. This is the core weakness of the LLMs - you have to micromanage them so much, that it runs counter to the core promise that is being pushed since 3+ years now.
  
  12 replies →
dan-robertson 4 months ago
I think the intuition is that they don’t ‘know’ that they are bad at counting characters and such, so they answer the same way they answer most questions.
- hansmayer 4 months ago
  
  Well, they can be made to use custom tools for writing to files and such, so I am not sure if that is the real reason? I have a feeling it is more because of trying to make this an "everything technology".
  
  1 reply →