Comment by mmooss

5 hours ago

I hate to say it, but might LLMs transform archival work? Not by replacing researchers, but by inputting everything (or orders of magnitude more than we could previously) and outputting to the researcher a prioritized list of documents / etc to examine?

10 comments

mmooss

lazyasciiart 3 hours ago

The bottleneck is physical work, as I understand it. And primarily delicate physical work that does not destroy the already disintegrating materials that are piled up in boxes for miles.

https://www.aaa.si.edu/documentation/digitizing-entire-colle...

jfengel 4 hours ago

If you could automate transcription, it would be an enormous boon to researchers.

Reading the handwriting would be really hard, and it would be a massive effort to move all that paper. Just handling it is hard; it's not like flipping through mass-manufactured books.

But I suspect that you could spend a few million dollars to revolutionize the field.

order-matters 3 hours ago
>automate transcription
this also means trusting the LLM to decide what things mean. but there is very likely a great middle ground of having LLMs take their best guesses and then verifying the output on significant finds. the risk is in LLM understating something important, false negatives, leading to putting stuff at the bottom of the pile that appears mundane but isnt
- mmooss 31 minutes ago
  
  That's why I suggest the output would be a prioritized list of documents for the researchers to review; the LLM doesn't get the final say, it just makes recommendations. Yes, things would be missed, but the resesarchers might in theory find much more value than their current search method.
kevin_thibedeau 2 hours ago

This is already the case with genealogical sites that have ML OCR creating searchable indices of handwritten documents.

garlic_enjoyer 3 hours ago

Assuming they have been transcribed, yes. The key idea that makes LLMs special is the attention mechanism. Maintaining attention over volumes of data is boring for most humans.

Also, to be pedantic, just taking about LLMs in this context is a tad reductive. There are many deep learning models involved in archival work that aren't language models.

I encourage you to read into this post for more context on what I mean: https://news.ycombinator.com/item?id=48675179

Digory 3 hours ago

I had ChatGPT translate some old, handwritten French legal documents for family history purposes. It was far more accurate than I expected.

At scale, with better models, we might have a way to clear out the old archives. Not only could you translate, you could ask it to triage the discoveries. "Would the average person find this noteworthy?"

contingencies 1 hour ago
I have a ton of handwritten German stuff from the 19th century. My grandmother could make a fair stab at it, but nobody left can read it. I've shown modern Germans and they are at a loss. Thanks for your idea, I will give it a look. Any tips on model/method/training?
- CamperBob2 15 minutes ago
  
  Try both Gemini Pro and ChatGPT. They are both outstanding at reading almost-unreadable documents. Use the highest thinking level your account supports.
  (If you want to post a sample or two here, I'll try it. I like to collect difficult out-of-distribution test materials.)

computerdork 4 hours ago

Oh, wow, that is actually an interesting application of ai