Comment by demosthanos

20 hours ago

Before commenting asking about why they don't just use LLMs, please note that the article specifically calls out that they do, but it's not always a viable solution:

> The agency uses artificial intelligence and a technology known as optical character recognition to extract text from historical documents. But these methods don’t always work, and they aren’t always accurate.

The document at the top is likely an especially easy document to read precisely because it's meant to be the hook to get people to sign up and get started. It isn't going to be representative of the full breadth of documents that the National Archives want people to go through.

Determining whether the latest off the shelf LLMs are good enough should be straight forward because of this:

“Some participants have dedicated years of their lives to the program—like Alex Smith, a retiree from Pennsylvania. Over nine years, he transcribed more than 100,000 documents”

Have different LLMs transcribe those same documents and compare to see if the human or machine is or accurate and by how much.

  • This is not an LLM problem. It was solved years ago via OCR. Worldwide, postal services long ago deployed OCR to read handwitten addresses. And there was an entire industry of OCR-based data entry services, much of it translating the chicken scratch of doctor's handwiting on medical forms, long before LLMs were a thing.

    • LLMs improve significantly on state of the art OCR. LLMs can do contextual analysis. If I were transcribing these by hand, I would probably feed them through OCR + an LLM, then ask an LLM to compare my transcription to its transcription and comment on any discrepancies. I wouldn't be surprised if I offered minimal improvement over just having the LLM do it though.

      4 replies →

    • For the addresses it might be a bit easier because they are a lot more structured and in theory and the vocabulary is a lot more limited. I’m less sure about medical notes although I’d suspect that there are fairly common things they are likely to say.

      Looking at the (admittedly single) example from the National Archives seems a bit more open than perhaps the other two examples. It’s not impossible thst LLMs could help with this

    • Yes, but there was usually a fall-back mechanism where an unrecognized address would be shown on a screen to an employee who would type it so that it could then be inkjetted with a barcode.

    • Fun fact, convolutional neural networks developed by Yann LeCunn were instrumental in that roll out!

OK, fair enough, but can you find one in this article that's hard for an LLM? The gnarliest one I saw, 4o handled instantly, and I went back and looked carefully at the image and the text and I'm sold.

Like if this is a crowdsourcing project, why not do a first pass with an LLM and present users with both the image and the best-effort LLM pass?

Later

I signed up, went to the current missions, and they all seem to post post-1900 and all typeset. They're blurry, but 4o cuts through them like a hot knife through butter.

  • My parents have saved letters from their parents which are written in cursive but in two perpendicular layers. Meaning the writing goes horizontally in rows and then when they got to the end of the page it was turned 90 degrees and continued right on top of what was already there for the whole page. This was apparently to save paper and postage. It looks like an unintelligible jumble but my mother can actually decipher it. Maybe that’s what the LLMs are having trouble with?

    Edit: apparently it’s called cross writing [1]

    1: https://highshrink.com/2018/01/02/criss-cross-letters/

    • Are they having trouble? You can sign up right now and get tasks from the archive that seem trivial for 4o (by which I mean: feed a screenshot to 4o, get a transcription, and spot check it).

  • Did you actually check it? Sonnet 3.5 generates text that seems legitimate and generally correct, but misreads important details. LLMs are particularly deceptive because they will be internally consistent - they'll reuse the same incorrect name in both places and will hallucinate information that seems legit, but in fact is just made-up.

    • Just have version control, and allow randomized spot checks with experts to have a known error rate.

    • You don't use LLM but other transformer based ocr models like trocr which has very low CER and WER rates

  • > Like if this is a crowdsourcing project, why not do a first pass with an LLM and present users with both the image and the best-effort LLM pass?

    Possibly for the reason that came up in your other post: you mentioned that you spot checked the result.

    Back when I was in historical research, and occasionally involved in transcription projects, the standard was 2-3 independent transcriptions per document.

    Maybe the National Archive will pass documents to an LLM and use the output as 1 of their 2-3 transcriptions. It could reduce how many duplicate transcriptions are done by humans. But I'll be surprised if they jump to accepting spot checked LLM output anytime soon.

    • You get that I'm not saying they should just commit LLM outputs as transcriptions, right?

  • My guess is because it’s the Smithsonian, they’re just not willing to trust an LLM’s transcription enough to put their name on it. I imagine they’re rather conservative. And maybe some AI-skeptic protectionist sentiments from the professional archivists. Seems like it could change with time though.

    • > My guess is because it’s the Smithsonian, they’re just not willing to trust an LLM’s transcription enough to put their name on it. I imagine they’re rather conservative

      I expect thats a common theme from companies like that, yet I don't think they understand the issue they think they have there.

      Why not have the LLMs do as much work as possible and have humans review and put their own name on it? Do you think they need to just trust and publish the output of the LLM wholeheartedly?

      I think too many people saw what a few idiot lawyers did last year and closed the book on LLM usage.

      7 replies →

    • The article is from The Smithsonian. The actual project is with the National Archives.

  • I don't know about this project, but I can easily find thousands of images that gpt-4o can't read, but a human expert can. It can do typed text excellently, antika-style cursive if it's very neat, and kurrent-style cursive never.

    • For straightforward reasons, I am commenting on this project, not the space of all possible projects. I did try, once, to get 4o to decode the Zodiac Killer's message. It didn't work.

  • I'm doing some genealogy work right now on my family's old papers covering the time period from recent years back to the late 17th century. Handwriting styles changed a lot over the centuries and individuals can definitely be identified by their personal cursive style of writing and you can see their handwriting change as they aged.

    Then you have the problem that some of these ancestors not only had terrible penmanship but also spelled multi-syllabic words phonetically since they likely were barely educated kids who spent more time when they were young working on the farm or ranch instead of attending school where they would've learned how to spell correctly.

    I don't know whether your LLM can handle English words spelled phonetically written in cursive by an individual who had no consistency in forming letters in the words. It is clear after reading a lot of correspondence from this person that they ignored things that didn't seem important in the moment like dotting i's or crossing t's or forming tails on g's, p's, j's, or even beginning letters consistently since they switched between cursive and block letters within a sentence, maybe while they paused to clarify their thoughts. I don't know but it is fascinating to take a walk through life with someone you'll never meet and to discover that many of the things that seemed awesome to you as a kid were also awesome to them and that their life had so many challenges that our generations will never need to endure.

    Some of my people have the most beautiful flowing cursive handwriting that looks like the cursive that I was taught in grade school. Others have the most beautiful flowing cursive with custom flourishes and adornments that make their handwriting instantly recognizable and easy to read once you understand their style.

    I think there are plenty of edge cases where LLMs will take a drunkard's walk through the scribble and spit out gibberish.

    I'm reminded of an old joke though.

    Ronald Reagan woke up one snowy Washington, DC morning and took a look out of the window to admire the new-fallen snow. He enjoys the beautiful scene laid out before him until he sees tracks in the snow below his window and a message obviously written in piss that said - "Reagan sucks".

    He dispatched the Secret Service to the site where samples were taken of the affected snow and photos of the tracks of two people were made.

    After an investigation he receives a call from the Secret Service agent in charge who tells him he has some good news and some bad news for him.

    The good news is that they know who pissed the message. It was George HW Bush, his Vice President. The bad news is that it was Nancy's handwriting.

  • Real quick, how long do you think chatgpto4 has existed? How long do you think the National Archive has been archiving?

    • It's 4o. The crowdsourced transcription project dates back to 2012. My comment is mostly on this article.

  • > Like if this is a crowdsourcing project...

    I'm confused by what you're asking. Are you asking me to like (upvote) your comment if this is a crowdsourcing project? Don't we already know it is a crowdsourcing project?

  • One that require additional work beyond simply feeding the image into the model would be this example which is a mix of barely legible hand written cursive and easy to read typed form. [0] Initially 4o just transcribes (successfully) the bottom half of the text and has to be prompted to attempt the top half at which point it seems to at best summarize the text instead of giving a direct transcription. [1] In fact it seems to mix up some portions of the latter half of the typed text with the written text in the portion of it's "transcription" about "reduced and indigent circumstances".

    [0] https://catalog.archives.gov/id/54921817?objectPage=8&object...

    [1] Reproducing here since I cannot share the chat since it has user uploaded images. " The text in the top half of the image is handwritten and partially difficult to read due to its cursive style and some smudging. Here's my best transcription attempt for the top section:

    ...resident within four? years, swears and says that the name of the John Hopper mentioned in the foregoing declaration is the same person, and he verily believes the facts as stated in the declaration are true.

    He further swears that the said John Hopper is in reduced and indigent circumstances and requires the aid of his country.

    The declarant further swears he has no evidence now in his power of service, except the statement of Capt. (illegible name), as to his reduced circumstances ...

    Sworn to before me, this day...

    Some parts remain unclear due to the handwriting, but let me know if you'd like me to attempt further clarification on specific sections!"

    • > this example which is a mix of barely legible hand written cursive and easy to read typed form.

      > In fact it seems to mix up some portions of the latter half of the typed text with the written text in the portion of it's "transcription" about "reduced and indigent circumstances".

      What typed form? What typed text? That image is a single handwritten page, and the writing is quite clean, not "barely legible".† The file related to John Hopper appears to be 59 pages, and some of them are typed, but they're all separate images.

      Are you trying to process all 59 pages at once? Why?

      I should note that transcription is an excellent use of an LLM in the sense of a language model, as opposed to an "LLM" in the sense of several different pieces of software hooked together in cryptic ways. It would be a lot more useful, for this task, to have direct access to the language model backing 4o than to have access to a chatbot prompt that intermediates between you and the model.

      † My biggest problems in reading the page: Cursive n and u are often identical glyphs (both written и), leading me to read "Ind." as "Jud."; and I had trouble with the "roster" at the bottom of the page. What felt weirdest about that was that the crossbar of the "t" is positioned well above the top of the stem, but that can't actually be what tripped me up, because on further review it's a common feature of the author's handwriting that I didn't even notice until I got to the very end of the letter. It's even true in the earlier instance of "Roster" higher up on the page. So my best guess is that the "os" doesn't look right to me.

      I misread 1758 as 1958, too, but hopefully (a) that kind of thing wears off as you get used to reading documents about the Revolutionary War; and (b) it's a red flag when someone who died in 1838 was born in 1958 according to a letter written in 1935.

Something about extraordinary claims and extraordinary evidence? The evidence presented, a seemingly easily transcribed image, is hardly persuasive.

  • Some are significantly harder to read. I took the page below and tried to get GPT 4o to transcribe it and it basically couldn't do it. I'm not going to sit and prompt hack for ages to see if it can but it seems unable to tackle the handwritten text at the top. When I first just fed it the image and asked for a transcription it only (but successfully) read the bottom portion, prompted for a transcription of the top it dropped into more of a summary of the whole document mainly pulling some phrases from the bottom text. (Sadly can't share it but I copied it's reply out in a comment upthread) [0]

    It was more successful at a few others I tried but it's still a task that requires manual processing like a lot of LLM output to check for accuracy and prompt modification to get it to output what you need for some documents.

    https://news.ycombinator.com/item?id=42746490