Comment by vidarh
5 days ago
You'd think so. It seems like there are a lot of odd gaps like that.
I also have a favourite English language PhD thesis I ask every new model about that they still struggle to find even though there's a Wikipedia article about it that links a blog post I wrote about it.
Anyone who thinks they've exhausted even publicly crawlable resources should ask them about some obscure stuff.
you might be surprised if you take this approach.. give key words and phrases in small amounts, each sentence of a prompt building on a previous sentence. Take a an example that is not very hard, like Lewis Carrol Alice in Wonderland original text. Although a quick question might get things sort of wrong, or miss details, if you guide the LLM to a certain part of the story, then a certain set of characters in that part of the story, then a certain statement or dramatic moment with those characters in that part of the story, you might get very specific detail that is close to line-by-line accurate. On the other hand, if you ask a quick, ordinary question about the same part of the story without supplying context and character names, you get something equally vague. YMMV
For the PhD thesis in question, I've actually tested a lot of requests about different parts of it, and both Claude and ChatGPT still draws a total blank if you don't let them do searches.
the models don't retain their full training data set
No, but they do retain enough that it is interesting what they fail to retain.