← Back to context

Comment by whizzter

4 months ago

Actually debugging a PDF parsing issue as we speak and actually started writing a parser (partially to understand the issue, partially as a last resort as the code in the parser I was debugging felt a bit shoddy).

The PDF format is frankly quite horrible, extended over the years by kludges that feels more or less like premature optimizations in some cases and bloated overkill in others.

While theoretically a nice idea, the issue is that there is just so many damn object types with specialized properties inside a PDF that you'd basically end up with all complications of a FFI for each binding you'd do to expose a sane subset.

Theoretically one could perhaps make a canonical PDF<->JSON or similar mapping from an established library that most PDF data consumers/generators could use if memory usage isn't too constrained (because the underlying object model isn't entirely dissimilar).

You can do:

  cpdf -output-json in.pdf -o out.json

(Modify out.json as liked)

  cpdf -j out.json -o out.pdf

(Disclaimer, I wrote it.)

  • Seems cool for document usage, the online JS version however thrashed the digital signatures with that rotate 10 degrees demo (not entirely if it was just a checksum issue but it seemed to be worse as in tinkering with or not roundtripping the signature data object).