← Back to context

Comment by firasd

1 day ago

When I see Dario say all this I wonder what he's seeing that seems so differentiated from my POV

I wonder how much of Anthropic's success is based on a horde of prompt to code examples that they post-train on. We shouldn't assume Opus' web dev skills are the same as general intelligence

Here's something Sonnet 4 did (Jul 2025) that I don't think a human intern would do when I made a wikisource tool (transcript follows)

#-#-#

F:

let’s see how the wikisource tool wrks.. let’s try to match “Mary hath chosen that good part” in one of the bible sources they should have

Claude: Great idea! Let me first search for some biblical content on WikiSource, then use the content match tool.

wiki_source_search

Perfect! Let me try the Douay-Rheims version of Luke, which should have the Mary and Martha story:

wiki_source_page_content_match

No matches! Let me try a slightly different phrasing – maybe it’s “hath chosen the good part”:

wiki_source_page_content_match

Still nothing. Let me try searching for just “Martha” to see if the story is there:

F:

dummy .. why would use kjv wording in a different edition lol

Claude: Haha you’re absolutely right! I was being completely silly – of course the KJV wording “Mary hath chosen that good part” wouldn’t be in the Douay-Rheims translation. Different translations use different language entirely.

What I Did Wrong: • Used King James Version phrasing in a Catholic translation • The Douay-Rheims says “Mary hath chosen the best part” • I was looking for “good part” (KJV wording)

Should Have Started With: Just searching for “Martha” or “Mary” to see what the actual text says in that translation, then used the exact wording I found.

#-#-#

It's almost comedic.. "Perfect! Let me try the Douay-Rheims version of Luke" is where if you were watching a horror movie you'd say "Don't go down to the basement fridge!"

And I think a human when they first see the text match fail would go "hmm--ohh" but Claude starts doing permutations. This I guess is the side effect of Reinforcement Learning and system prompts that amount to saying: "Just do it. Don't ask questions. Just do it."

I find one-off anecdotal examples like this to be a bit like discourse around global warming - "Look at that ridiculous polar vortex we had this week! Global warming can't possibly be a thing!" Of course, a trend line comprises many points, and not every point falls perfectly in the center of the line! I'm not necessarily saying you are right or wrong, but your argument should address the line (and ideally give some reason why it might falter) rather than just a single point on that line.

  • Ah but I'm not arguing about the rate of change in the trend. I'm saying the signals are decoupled. That is to say an LLM can be as good as a programmer as Linus Torvalds without having even basic knowledge-generalization abilities we assume the median human with no specialized skills would have (when given the same knowledge an LLM has)

    • I think most LLM proponents would say that "basic knowledge-generalization abilities" is on a different, slower trend line.

      I mean, you aren't very surprised that your CPU can crush humans at chess but can barely run an image classifier, right? But you probably wouldn't say (as you are saying with LLMs) that ability for a CPU to play chess is "decoupled" from classifying images. Increases in CPU speed improve both. You'd just say that one is a lot harder than the other.

> When I see Dario say all this I wonder what he's seeing that seems so differentiated from my POV

Billions of dollars

I have no idea what you are even asking Claude to do here.

  • I was asking it to see if the wikisource tools are working by looking up a Bible quote. There was no ambiguity about the task itself; what I'm saying is that Claude 'knows' a bunch of things (the Bible has different translations) that it doesn't operationalize when doing a task--issues that would would be glaringly obvious to a human who knows the same things

    • Maybe I'm missing the point as well, but what did it do wrong?

      It seemed like you wanted to see if a search tool was working.

      It looked to see. It tried one search using on data source KJ and found no matches. Next question would be is the quote not in there, is there a mis-remembering of the quote or is their something wrong with the data source. It tries an easier to match quote and finds nothing, which it finds odd. So next step in debugging is assume a hypotheses of KJ Bible datasource is broken, corrupted or incomplete (or not working for some other reason). So it searches for an easier quote using a different datasource.

      It's unclear the next bit because it looks like you may have interrupted it, but it seems like it found the passage about Mary in the DR data source. So using elimination, it now knows the tool works (it can find things), the DR data source works (it can also find things), so back to the last question of eliminating hypotheses: is the quote wrong foe the KJ datasource, or is that datasource broken.

      The next (and maybe last query I would do, and what it chose) was search for something guaranteed to be there in KJ version: the phrase 'Mary'. Then scan through the results to find the quote you want, then re-query using the exact quote you know is there. You get 3 options.

      If it can't find "Mary" at all in KJ dataset then datasource is likely broken. If it finds mary, but results don't contain the phrase, then the datasource is incomplete. If it contains the phrase then search for it, if it doesn't find it then you've narrowed down the issue "phase based search seems to fail". If it does find and, and it's the exact quote it searched for originally then you know search has an intermittent bug.

      This seemed like perfect debugging to me - am I missing something here?

      And it even summarized at the end how it could've debugged this process faster. Don't waste a few queries up front trying to pin down the exact quote. Search for "Mary" get a quote that is in there, then search for that quote.

      This seems perfectly on target. It's possible I'm missing something though. What were you looking for it to do?

      2 replies →