Comment by esafak
17 days ago
It's not ironic. PDFs are a container, which can hold scanned documents as well as text. Scanned documents need OCR and to be analyzed for their layout. This is not a failing of the PDF format, but a problem inherent to working with print scans.
I don't claim PDF is a good format. It is inscrutable to me.
Pdf is a horrible format. Even if it contains plain text it has no concept of something as simple as paragraphs.
One can wonder how much wonkiness of llms comes from errors in extracting language from pdfs.
Adobe is the most harmful software development company in existence.
amen