Comment by rqtwteye

3 years ago

I still don't understand how we ended up with PDF as sort of standard to archive data. PDF is already pretty bad for things like manuals but for things like spreadsheets we basically collect the data, then we destroy all the structure by putting it in into POF, and later on we painstakingly try to restore the data from PDF which is often almost impossible to do with accuracy.

It just shows that bad solutions often win.

I've thought about this and come round to think that the flaws of PDF are actually essential to the success of the document format.

- Non-responsive (compared to HTML). Allows PDFs to serve as a common standard between other document formats with different resizing logic, like Latex and Word.

- Difficultly of network access from code running inside document. Allows PDFs to generally operate offline. Nobody's brave enough to try to write a single page application in a PDF

- Destroying data structure. Allows forward compatibility with anything that can be displayed statically on a screen. New applications can have different ideas about how tables, text or charts should work but if there's static visual output then it'll convert to PDF. Awareness of say, the structure of tables is precisely what makes it so difficult for say google sheets and excel to stay compatible with each other's new table features. If somebody develops a new language with new characters not even in Unicode it'll still work on a PDF

It's also worth noting that most PDF limitations have the characteristic of making things hard but not absolutely impossible. These escape hatches prevent people with hard requirements from actually moving to a new format.

If it were truly impossible to get invoice data from PDFs people might've shifted to a different format for business transactions. But if it's merely difficult some company will come up with an API that works as a good enough extraction solution whose cost is justified by the other compatibility benefits of PDFs, so the ecosystem stays with PDFs.

  • >Difficultly of network access from code running inside document. Allows PDFs to generally operate offline. Nobody's brave enough to try to write a single page application in a PDF.

    You can absolutely do so. Most times however, the desire is to embed the latest cut of info into the PDF, then hand it off to somebody who will not have network access.

    t. Been there, done that. Had the end product thrown out because of Adobe's licensing terms. I also met one of the people responsible for the tooling I had to suffer through. I have their address, but they apologized, and explained the internal politics at the time; so I've chilled on the whole crushing their genitalia with a large wrench bit.

    Long story short: doable, but

      Do Not Follow. 
      This is not a place of honor. 
      No great deed was once commemorated here
      That which remains is repulsive to us, in our time, as it will be in yours. 

    Seriously. If I could fill this post with spikes and sick faces, I would.

      Vvvvvvvvvvvvvvvvvvvvvvvvvvvvv
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    

    XFA was the dream of madmen, and sadists, that decent men thought they could wrangle some positive utility out of. They were wrong.

    The trefoil is not an angel. The weird ring things are symbols for infectious waste.

  • Oh but there is:

    https://en.wikipedia.org/wiki/Apache_Flex

    Not sure if I linked to the right article, but it was basically compiled scripts/code that was embedded into PDF's that could run arbitrary code.

    ""Apache Flex, formerly Adobe Flex, is a software development kit (SDK) for the development and deployment of cross-platform rich web applications based on the Adobe Flash platform.""

For this particular case, the use of PDFs seems irrelevant. Photos were just taken of each polling unit’s results. These photos happened to then be embedded into PDFs for distribution, but the core underlying data is just an image embedded into that PDF. No important data was destroyed when these photos were placed into PDFs.

> how we ended up with PDF as sort of standard to archive data.

I don’t think we really did. They are a standard for archiving typeset page-based documents.

Of course, paper documents used to be standard for archiving data, and some continue to do so in the form of PDF.

In principle, it is possible to integrate all the structure you want in a PDF (using Marked Content, Structure Attributes and User Properties), but for data (as opposed to document structure) you’d need custom software to generate and interpret that.

Because PDF shows you a page on screen that will look the same if you print it out, and print layouts have been optimized for reading convenience over centuries. And if you give someone with no technical expertise a pdf file, it's virtually certain that they're going to be able to open it because some kind of viewer is built into most operating systems.

You're totally right about PDF being a massive pain in the butt for any other purpose, but unless you have an alternative that handles the basic use case at least as well and other use cases way better, PDF is here to stay.

It's old, and sometimes things don't come out right, but this is one way out of that hornet's nest.

https://tabula.technology

There's also a CLI if that is more to your liking. If that doesn't do it, there's always the brute-force option of scripting in your language of choice to pull the data out.

these are just photos embedded in a PDF, which actually isn't that bad an idea, because it lets you scan multiple pages and join them together as a 'document'

(not sure if the documents in OP had several pages, but if you've scanned/photographed a multi-page document, PDF is not that bad of a solution)

  • A better option would be to use the TIFF format. You can use it as a container format to store lossless and lossy image formats, and handles multiple images in a single container.

    It was the standard for scanners until PDF seemed to dominate the scene.

    • > It was the standard for scanners until PDF seemed to dominate the scene.

      Probably because it's much easier (for average users with few tools and skills) to print a PDF than to print any sort of non-page-based (e.g., image) file format and have the resulting sheet of paper match the scanned sheet of paper in terms of scale, orientation, position -- assuming both sheets are the same dimensions. Essentially using the file as an intermediary for physical copying of standard paper documents.

      1 reply →

    • Except who knows if your application that supports TIFF files actually supports the features you want (multiple images, the compression format, etc)

      1 reply →

  • I'm a teacher in the first year of the university. During the remote classes in the pandemic, we made almost mandatory to upload the photos of the take homes and questions using camscaner [1].

    The student just download the app, and it fix the orientation, rotation, bad light, contrast, and many other horrible things that a jpg may have. In particular the orientation and ordering multiple sheets. Also, Moodle has a little more support for pdf than jpg [2].

    I don't know how many three letter agencies are reading the stream, but I'm happy that many three letter agencies operative now have a better formation in algebra and calculus.

    [1] https://www.camscanner.com/

    [2] It depends on how many optional packages your sysadmin installed.

Back in the day there were at least two programs competing for the role that PDF fills today that I remember: diskpaper and PDF. Apple also had one for its developer docs, but it was never released commercially, I believe.

PDF provided more fidelity for printing, had better tooling (it was by Adobe after all), it was cross-platform, could be displayed on the desktop, so it won. The reader was cross-platform so end-users didn't have to mess with installing plugins for various image types. And because everyone in the document creation division(1) used Postscript to print, printing to PDF was super-easy. And at some point everyone had a postscript printer driver on their machine, so printing to PDF because super-easy as well.

It's not an archiving tool, but people use it for archiving...just like the way a spreadsheet isn't a project management tool, but millions of people use it for project management.

At this point the network effects for the PDF file format would make it difficult to replace. With PDF you can practically guarantee(2) that the file will look the same on any device.

(1) This was more true back then than today, probably (2) assuming that you embedded the fonts, and that the reader doesn't suck.

What's funny is I don't think Adobe really makes any money off of PDF; it's an accidental de-facto standard.

  • > PDF provided more fidelity for printing, had better tooling

    This might have been true once, but using Acrobat now is so painful. Of all the apps that work, Apples Preview is my editor of choice and when I’m on Windows I really miss it.

    • Well before nobody actually dealt with PDFs directly; they exported it out of FrameMaker or whatever tool they used to compose stuff (ie: print-to-pdf).

      Acrobat has always been a really bad PDF editor. I'm not sure why that is, exactly, since their other editing tools were basically industry standard for a long time. All the interactive stuff like fillable forms, etc are probably incredibly hard to build.

It depends. There are PDFs with rasterized images of text (like in the article, when it’s a scan or photo of a document), then there are PDFs with vector positioned text runs (when it’s usually a result of some digital process). The latter are way easier to process than the former.