Comment by roywashere

1 year ago

I think it is very ironic that we chose to use PDF in many fields to archive data because it is a standard and because we would be able to open our pdf documents in 50 or 100 years time. So here we are just a couple of years later facing the challenge of getting the data out of our stupid PDF documents already!

3 comments

roywashere

esafak 1 year ago

It's not ironic. PDFs are a container, which can hold scanned documents as well as text. Scanned documents need OCR and to be analyzed for their layout. This is not a failing of the PDF format, but a problem inherent to working with print scans.

I don't claim PDF is a good format. It is inscrutable to me.

scotty79 1 year ago
Pdf is a horrible format. Even if it contains plain text it has no concept of something as simple as paragraphs.
One can wonder how much wonkiness of llms comes from errors in extracting language from pdfs.
Adobe is the most harmful software development company in existence.
- phoh 1 year ago
  
  amen