Comment by Macha
5 hours ago
> The key insight is that bank statement PDFs are almost always columnar. Of course, this relies on the PDF having a proper text layer; if your bank sends you scanned images, you’re out of luck (though I’ve yet to encounter one that does). When you convert them to text while preserving the layout, you get something that looks like this:
So I decided to try this out with my bank who's export options are (one of the mentioned slightly silly multi-line format) XLSX or PDF only, and it appears they've done some "encryption" (really a simple substitution cipher and an embedded font with the characters jumbled up so it renders correctly) to the PDF to prevent this. All the marketing text and headers are in the pdftotext output fine but the actual data is all accented and non-printable characters (also if you copy/paste out).
The substitution cipher does seem stable across a few statements, but still seems like less work to work off the XLSX
I remember seeing an online shop that did the whole font substitution to prevent web-scraping of their prices.. I think they even changed the substitution between elements so one couldn't just do a single pass replacement and get the original data back..
I guess nowadays it's very cheap to run a headless browser, screenshot the output, and run it through OCR.. hah, to prevent that they'd have to design their webpage as 1 full screen Captcha..
My bank outputs different data in the description field for CSV and PDF. The PDF statement descriptions are longer and contain more information.
Interesting! You might want to try Tabula in that case.
For that type of "obfuscated" PDFs I've come across, it does well, it's just a lot slower to run than pdf2text.
It appears Tabula also gets the substituted content instead.
What I'm seeing is that for example, POS is substituted to & !ë on every line in every file, etc. I can see by comparing to the rendered PDF for other common text (like my name, the local supermarket, etc) that those all seem to be 1:1 substitutions too.
That's a ridiculously dumb idea on the bank's part.
Print the PDF to an image. Then use OCR. Then import the output from that instead.
> That's a ridiculously dumb idea on the bank's part.
Yes. The local banks here are pretty reliable for jumping on every dumb idea that anyone anywhere claims improves security.