Comment by varenc
19 hours ago
My guess is because it’s the Smithsonian, they’re just not willing to trust an LLM’s transcription enough to put their name on it. I imagine they’re rather conservative. And maybe some AI-skeptic protectionist sentiments from the professional archivists. Seems like it could change with time though.
> My guess is because it’s the Smithsonian, they’re just not willing to trust an LLM’s transcription enough to put their name on it. I imagine they’re rather conservative
I expect thats a common theme from companies like that, yet I don't think they understand the issue they think they have there.
Why not have the LLMs do as much work as possible and have humans review and put their own name on it? Do you think they need to just trust and publish the output of the LLM wholeheartedly?
I think too many people saw what a few idiot lawyers did last year and closed the book on LLM usage.
> Why not have the LLMs do as much work as possible and have humans review and put their own name on it?
That's not a good way to improve on the accuracy of the LLM. Humans reviewing work that is 95% accurate are mostly just going to rubber-stamp whatever you show them. This is equally a problem for humans reviewing the work of other humans.
What you actually want, if you're worried about accuracy, is to do the same work multiple times independently and then compare results.
The incident with the lawyers just highlighted the fundamental problem with LLMs and AI in general. They can't be trusted for anything serious. Worse, they give the apppearence of being correct, which leads humans "checkers" into complacency. Total dumpster fire.
Instead of thinking about this as an all-or-nothing outcome, consider how this might work if they were made accessible with LLMs, and then you used randomized spot checks with experts to create a clear and public error rate. Then, when people see mistakes they can fix them.
I’m trying to do this for old Latin books at the Embassy of the Free Mind in Amsterdam. So many of the books have never been digitized, let alone OCRd or translated. There is a huge amount of work to be done to make these works accessible.
LLMs won’t make it perfect. But isn’t perfect the enemy of the good? If we make it an ongoing project where the source image material is easily accessible (unlike in a normal published translation, where you just have to trust the translator), then the knowledge and understanding can improve over time.
This approach also has the benefit of training readers not to believe everything they read — but to question it and try to get directly at the source. I think that’s a beautiful outcome.
4 replies →
The article is from The Smithsonian. The actual project is with the National Archives.