Comment by greggsy
3 days ago
It's trivial to normalise the various formats, and there were a few libraries and ML models to help parse PDFs. I was tinkering around with something like this for academic papers in Zotero, and the main issue I ran into was words spilling over to the next page, and footnotes. I totally gave up on that endeavour several years ago, but the tooling has probably matured exponentially since then.
As an example, all the academic paper hubs have been using this technology for decades.
I'd wager that all of the big Gen AI companies have planned to use this exact dataset, and many or them probably have already.
> It's trivial to normalise the various formats,
Ha. Ha. ha ha ha.
As someone who as pretty broadly tried to normalize a pile of books and documents I have legitimate access to, no it is not.
You can get good results 80% of the time, usable but messy results 18% of the time, and complete garbage the remaining 2%. More effort seems to only result in marginal improvements.
98% sounds good enough for the usecase suggested here.
Writing good validators for data is hard. You can be 100% sure that there will be bad data in those 98%. From my own experience I thought I had 50% of the books converted correctly and then I found I still had junk data and gave up, it is not an impossible problem I just was not motivated to fix it on my own. Working with your own copies is fine, but when you try to share that you get into legal issues that I just do not feel are that interesting to solve.
Edit: my point is that I would like to share my work but that is hard to do in a legal way. That is the main reason I gave up.
2% garbage, if some of that garbage falls out the right way, is more than enough to seriously degrade search result quality.
1 reply →