← Back to context

Comment by fake-name

4 days ago

> It's trivial to normalise the various formats,

Ha. Ha. ha ha ha.

As someone who as pretty broadly tried to normalize a pile of books and documents I have legitimate access to, no it is not.

You can get good results 80% of the time, usable but messy results 18% of the time, and complete garbage the remaining 2%. More effort seems to only result in marginal improvements.

98% sounds good enough for the usecase suggested here.

  • Writing good validators for data is hard. You can be 100% sure that there will be bad data in those 98%. From my own experience I thought I had 50% of the books converted correctly and then I found I still had junk data and gave up, it is not an impossible problem I just was not motivated to fix it on my own. Working with your own copies is fine, but when you try to share that you get into legal issues that I just do not feel are that interesting to solve.

    Edit: my point is that I would like to share my work but that is hard to do in a legal way. That is the main reason I gave up.

  • 2% garbage, if some of that garbage falls out the right way, is more than enough to seriously degrade search result quality.