Comment by lalitmaganti

2 months ago

Interesting! You might want to try Tabula in that case.

For that type of "obfuscated" PDFs I've come across, it does well, it's just a lot slower to run than pdf2text.

1 comment

lalitmaganti

It appears Tabula also gets the substituted content instead.

What I'm seeing is that for example, POS is substituted to & !ë on every line in every file, etc. I can see by comparing to the rendered PDF for other common text (like my name, the local supermarket, etc) that those all seem to be 1:1 substitutions too.