Comment by ck_one

3 hours ago

Just tested the new Opus 4.6 (1M context) on a fun needle-in-a-haystack challenge: finding every spell in all Harry Potter books.

All 7 books come to ~1.75M tokens, so they don't quite fit yet. (At this rate of progress, mid-April should do it ) For now you can fit the first 4 books (~733K tokens).

Results: Opus 4.6 found 49 out of 50 officially documented spells across those 4 books. The only miss was "Slugulus Eructo" (a vomiting spell).

Freaking impressive!

42 comments

ck_one

golfer 2 hours ago

There's lots of websites that list the spells. It's well documented. Could Claude simply be regurgitating knowledge from the web? Example:

https://harrypotter.fandom.com/wiki/List_of_spells

ck_one 2 hours ago
It didn't use web search. But for sure it has some internal knowledge already. It's not a perfect needle in the hay stack problem but gemini flash was much worse when I tested it last time.
- viraptor 2 hours ago
  
  If you want to really test this, search/replace the names with your own random ones and see if it lists those.
  Otherwise, LLMs have most of the books memorised anyway: https://arstechnica.com/features/2025/06/study-metas-llama-3...
  
  4 replies →
- joshmlewis 2 hours ago
  
  I think the OP was implying that it's probably already baked into its training data. No need to search the web for that.
  
  1 reply →
- soulofmischief 1 hour ago
  
  The only worthwhile version of this test involves previously unseen data that could not have been in the training set. Otherwise the results could be inaccurate to the point of harmful.
- eek2121 2 hours ago
  
  Honestly? My advice would be to cook something custom up! You don't need to do all the text yourself. Maybe have AI spew out a bunch of text, or take obscure existing text and insert hidden phrases here or there.
  Shoot, I'd even go so far as to write a script that takes in a bunch of text, reorganizes sentences, and outputs them in a random order with the secrets. Kind of like a "Where's Waldo?", but for text
  Just a few casual thoughts.
  I'm actually thinking about coming up with some interesting coding exercises that I can run across all models. I know we already have benchmarks, however some of the recent work I've done has really shown huge weak points in every model I've run them on.
  
  1 reply →

xiomrze 3 hours ago

Honest question, how do you know if it's pulling from context vs from memory?

If I use Opus 4.6 with Extended Thinking (Web Search disabled, no books attached), it answers with 130 spells.

petercooper 3 hours ago
One possible trick could be to search and replace them all with nonsense alternatives then see if it extracts those.
- andai 2 hours ago
  
  That might actually boost performance since attention pays attention to stuff that stands out. If I make a typo, the models often hyperfixate on it.
ck_one 2 hours ago

When I tried it without web search so only internal knowledge it missed ~15 spells.
ozim 2 hours ago
Exactly there was this study where they were trying to make LLM reproduce HP book word for word like giving first sentences and letting it cook.
Basically they managed with some tricks make 99% word for word - tricks were needed to bypass security measures that are there in place for exactly reason to stop people to retrieve training material.
- pron 2 hours ago
  
  This reminds me of https://en.wikipedia.org/wiki/Pierre_Menard,_Author_of_the_Q... :
  > Borges's "review" describes Menard's efforts to go beyond a mere "translation" of Don Quixote by immersing himself so thoroughly in the work as to be able to actually "re-create" it, line for line, in the original 17th-century Spanish. Thus, Pierre Menard is often used to raise questions and discussion about the nature of authorship, appropriation, and interpretation.
- ck_one 2 hours ago
  
  Do you remember how to get around those tricks?
  
  3 replies →
clanker_fluffer 3 hours ago

What was your prompt?

meroes 3 hours ago

What is this supposed to show exactly? Those books have been feed into LLMs for years and there's even likely specific RLHF's on extracting spells from HP.

muzani 2 hours ago

There was a time when I put the EA-Nasir text into base64 and asked AI to convert it. Remarkably it identified the correct text but pulled the most popular translation of the text than the one I gave it.
rvz 2 hours ago

> What is this supposed to show exactly?
Nothing.
You can be sure that this was already known in the training data of PDFs, books and websites that Anthropic scraped to train Claude on; hence 'documented'. This is why tests like what the OP just did is meaningless.
Such "benchmarks" are performative to VCs and they do not ask why isn't the research and testing itself done independently but is almost always done by their own in-house researchers.

kybernetikos 1 hour ago

I recently got junie to code me up an MCP for accessing my calibre library. https://www.npmjs.com/package/access-calibre

My standard test for that was "Who ends up with Bilbo's buttons?"

dom96 44 minutes ago

I often wonder how much of the Harry Potter books were used in the training. How long before some LLM is able to regurgitate full HP books without access to the internet?

dwa3592 1 hour ago

have another LLM (gemini, chatgpt) make up 50 new spells. insert those and test and maybe report here :)

muzani 2 hours ago

There's a benchmark which works similarly but they ask harder questions, also based on books https://fiction.live/stories/Fiction-liveBench-Feb-21-2025/o...

I guess they have to add more questions as these context windows get bigger.

zamadatix 3 hours ago

To be fair, I don't think "Slugulus Eructo" (the name) is actually in the books. This is what's in my copy:

> The smug look on Malfoy’s face flickered.

> “No one asked your opinion, you filthy little Mudblood,” he spat.

> Harry knew at once that Malfoy had said something really bad because there was an instant uproar at his words. Flint had to dive in front of Malfoy to stop Fred and George jumping on him, Alicia shrieked, “How dare you!”, and Ron plunged his hand into his robes, pulled out his wand, yelling, “You’ll pay for that one, Malfoy!” and pointed it furiously under Flint’s arm at Malfoy’s face.

> A loud bang echoed around the stadium and a jet of green light shot out of the wrong end of Ron’s wand, hitting him in the stomach and sending him reeling backward onto the grass.

> “Ron! Ron! Are you all right?” squealed Hermione.

> Ron opened his mouth to speak, but no words came out. Instead he gave an almighty belch and several slugs dribbled out of his mouth onto his lap.

sobjornstad 1 hour ago

I have a vague recollection that it might come up named as such in Half-Blood Prince, written in Snape's old potions textbook?
In support of that hypothesis, the Fandom site lists it as “mentioned” in Half-Blood Prince, but it says nothing else and I'm traveling and don't have a copy to check, so not sure.
ck_one 2 hours ago

Then it's fair that id didn't find it

bartman 2 hours ago

Have you by any chance tried this with GPT 4.1 too (also 1M context)?

LanceJones 2 hours ago

Assuming this experiment involved isolating the LLM from its training set?

irishcoffee 1 hour ago

The top comment is about finding basterized latin words from childrens books. The future is here.

Geste 1 hour ago

I'll have some of that coffee too, this is quite a sad time we're living where this is a proper use of our limited resources.

TheRealPomax 1 hour ago

That doesn't seem a super useful test for a model that's optimized for programming?

guluarte 3 hours ago

you can get the same result just asking opus/gpt, it is probably internalized knowledge from reddit or similar sites.

ck_one 2 hours ago

If you just ask it you don't get the same result. Around 13 spells were missing when I just prompted Opus 4.6 without the books as context.

adarsh2321 2 hours ago

[dead]