← Back to context

Comment by rstuart4133

2 hours ago

My own version of this is that I wanted bank transactions in CSV format. However, the transactions were more than a year old, and the bank only provides recent transactions in a downloadable form. They do, however, provide statements in PDF format going back indefinitely. But the objects in the PDF are arranged in a way that made pdftotext output near-indecipherable.

I thought I'd give Gemini a go. When I uploaded the 18-page PDF, it complained the output exceeded some limit. So I used pdftk to break it up into 4-page chunks, which seemed to work - the output looked very good and passed a couple of spot checks. But I don't trust these things as far as I can kick them.

There was a transaction column and a running balance column, so I did a quick check to see if every new balance equalled the previous one plus the transaction. And it almost always did. There were a couple of errors I put down to transcription errors. I was wrong. I eventually twigged that these errors only happened where I had split the PDFs. After tracking where the balance first went wrong, it became evident it had dropped chunks of lines, duplicated others, and misaligned the transaction and balance columns. It was complete rubbish, in other words.

So why did my balance check show so few errors? I put that down to it knowing what a good bank statement looked like. A good bank statement adds up. So it adjusted all the balances so it looked like a real bank statement. I also noticed these errors got more frequent in later pages. I tried splitting the PDF into single pages and loading them into the model one at a time. That didn't help much for the later pages, but the first one was usually good. So then I loaded each page into a fresh context, with a fresh prompt. If that didn't produce something that balanced, the second go always did.

I'm not sure it saved time over doing it manually in the end. It's a tired analogy now, but it's true: at their heart, these things are stochastic parrots. They almost never produce the same output twice when given the same input. Instead, they produce output that has a high probability of following the input tokens supplied. If there is only one correct output but the output is small enough, the odds are decent they will get it right. But once the size grows, the odds of it outputting complete crap become a near certainty.