Comment by eesmith
18 days ago
I'm trying to understand how to judge the quality of the result, that is, to better quantify what "relatively accurate character identification and relationship mapping" means.
For example, with the "The Adventures of Tom Sawyer" example, I see Bob Tanner connected to Huckleberry Finn but no one else.
Is that supposed to be significant? I pulled up the source text from https://www.gutenberg.org/cache/epub/74/pg74-images.html :
The first occurrence is an exchange starting:
Tom hailed the romantic outcast:
“Hello, Huckleberry!”
...
“No, I hain’t. But Bob Tanner did.”
The last line is spoken by Huckleberry. It is clear that both kids know who Bob Tanner is, because Tom mentions "he’s the wartiest boy in this town". (The image prompt says "might be holding a bean or have warts on his hands", but the bean is the method that Tom and Huck use.)
The other context is:
> “Well, I have too,” said Tom; “oh, hundreds of times. Once down by the slaughter-house. Don’t you remember, Huck? Bob Tanner was there, and Johnny Miller, and Jeff Thatcher, when I said it. Don’t you remember, Huck, ’bout me saying that?”
So what does it mean that Huck has a connection to Bob but Tom does not, when it seems equally strong in the text?
Or, we see in the graph that "Bull Harbison" is "a dog that howls outside the tannery", which isn't correct. Tom thinks it's Bull, but after another howl they realize it's actually a stray.
Why is Mr. Jones, "the Welshman" referred to as "old man"?
Why is Mrs. Thatcher not listed? Or the Rev. Mr. Sprague, the "Useful Minister" as chapter V's title describes him?
There are also some characters mentioned only once, like "Mr. Benton, an actual United States Senator" and "Major and Mrs. Ward; lawyer Riverson", who are not on the graph, while names like Benny Taylor ("Benny Taylor’s little wagon") and Jimmy Hodges ("he more than half envied Jimmy Hodges, so lately released") are in the graph. Why?
And there's "the cat" just hanging about with no connection.
Also, there should be no link from Jimmy to Huck as its not in the text, and we don't know if Benny or Jimmy are young boys, as characterized in the bio.
The graph shows a connection between Uncle Jake and Jim ("Friends") which doesn't exist in the book, where Jake is mentioned two times in a single paragraph. Is there a built-in assumption in the model that the two named Black characters in the book, both slaves, would be friends?
It says that Mary and Aunt Polly are niece/aunt respectively, but I don't see that in the text. We know that Mary is Tom's cousin, and Polly is Tom's aunt, but we don't know the relationship between Mary and Polly. Could they be mother/daughter? https://en.wikipedia.org/wiki/List_of_Tom_Sawyer_characters#... says it's never specified.
It seems like it would be a lot of work to verify both the correctness of the generated data, and verify there are not missing parts. Does it really save time and effort?
To be sure, these are small parts of the books, but then again, Tom Sawyer is one of the most analyzed books in the American canon, where there should be a lot of examples in the corpus describing relations between the main characters.
Great observations! Thanks for your deep dive into result. I didn't go into this level of detail myself, but one thing I notice is that "the cat" in the graph is actually Peter, the cat that Tom gave painkiller to (with missing connections to Tom and Aunt Polly).
You're absolutely right that some characters are missing even in those short books, and there are likely many more relationships that haven't been fully captured. That said, I’m still quite impressed by how much data the LLM extracted in a single pass, especially given the complexity of the task, the size of the input, and the strict output format.
My estimate of quality was subjective. To truly quantify accuracy, we’d need to establish a "ground truth" with a better approach and measure the difference between the generated and actual relationship graphs. One possible way to do that would be to process the text in multiple passes: first extracting characters, then identifying relationships, both steps with more sophisticated prompt engineering. Another way is to manually annotate the network. The only book I found with a publicly available, human-annotated character network is Les Misérables, based on Donald Knuth’s work: https://github.com/MADStudioNU/lesmiserables-character-netwo...
However, there is an additional challenge. Even with human annotation, the question remains: how to define relationship network? What is a relationship in a book? Should it be limited to explicitly stated connections in the text, or it also can include deduced relationships based on context with some probability? Defining these criteria is crucial to quantify quality of the result.
I still think you should try a book which has been much less studied than the ones you mentioned. The LLM is almost certainly trained on Wikipedia, which has a lot of this information, plus a lot of essays for high school level assignments.
I found 'Annotating Characters in Literary Corpora: A Scheme, the CHARLES Tool, and an Annotated Novel' at https://aclanthology.org/L16-1028/ which describes some manual annotation efforts for Pride and Prejudice. I don't know if the result is available, but the text suggests it is.
It points out a fun observation: "characters maybe referred to by multiple names, sometimes drastically different (e.g. Dr. Jeykll and Mr. Hyde)"
Huh. https://aclanthology.org/2022.latechclfl-1.10.pdf says "that the character networks of translations differ from originals in case of long novels, and the differences may also vary depending on the novel and translator’s strategy."
Ooo, it cites https://theseaofbooks.com/2016/04/29/the-5-least-important-c... which is about the 5 least important characters in Pride and Prejudice:
> So if you filled out our reader survey and are fairly sure you didn’t come across 117 people in Pride and Prejudice last time you read it, this is because when we compiled that list, we added every last entity that could possibly be considered a character. In fact, Pride and Prejudice has a small, cast of characters, compared to certain of our other novels. Ever wanted to know the population of Middlemarch, for example? By our reckoning, it’s the tidy figure of 333! (Admittedly, some of them are goats.)
This might be useful: "Using Citizen Science to study literary social networks" at https://txtlab.org/2024/12/using-citizen-science-to-study-li...
> By mobilizing volunteers to annotate character interactions, we gathered a high-quality dataset of 13,395 labeled interactions from contemporary fiction and non-fiction books. This dataset forms the foundation for understanding how genres and audience factors influence the social structures in narratives.
This appears to be an interesting field, which I have no time to explore any further. :(