Comment by simonw

9 months ago

Something I don't get is why OpenAI don't provide clear, comprehensive documentation as to how this actually works,

I get that there's competition from other providers now so they have an instinct to keep implementation details secret, but as someone building on their APIs this lack of documentation really holds me back.

To make good judgements about how to use this stuff I need to know how it works!

I had a hilarious bug a few weeks ago where I loaded in a single image representing multiple pages of a PDF and GPT-4 vision effectively hallucinated the contents of the document when asked to OCR it, presumably because the image was too big and was first resized to a point where the text was illegible: https://simonwillison.net/2024/Apr/17/ai-for-data-journalism...

If OpenAI had clear documentation about how their image handling works I could avoid those kinds of problems much more effectively.

I was trying to figure out this exact same issue. OCR on a PDF worked great, up until a certain point when it just started hallucinating like crazy. I was working on a whole pipeline to just feed in a PDF one page at a time to try and get around this issue. Otherwise, the OCR works absolutely fantastic compared to all other other tools I've been trying lately. These include OCRmyPDF (Tesseract), SuryaOCR, and some of the models on the Visual LLM Leaderboard.

I've also seen some people recommend Paddle OCR, but I find their documentation to be lacking and I haven't got that one working yet to evaluate.

But they do document that the images are resized and give you some rough guidelines on how you should be sizing your images. Low resolution is 1024 x 1024 with no tiling and High Resolution starts at 2048 x 2048 which then gets tiled. It could use further documentation but it is enough to know more than one page should never be used via the API.

  • Right. But I still have a lot of questions. How does the model handle when something important overlaps multiple tiles in high-resolution mode? Am I better off doing the tiling myself with some overlap?

The fact that it's so eager to hallucinate random things that sounds plausible enough if you're not paying attention without warning you or giving any error should make people reconsider using it for "data journalism" or similar.

If you make your system and it "works", then how will you see the one time out of X where it confidently provides you false information that you happily use because it usually work ?

  • > how will you see the one time out of X where it confidently provides you false information that you happily use because it usually work ?

    You don’t. You treat it like you would a human worker: set your process to detect or tolerate wrong output. If you can't, don't apply this tool to your work.

    • This is true but misses a key fact, that typical llm errors are different to human errors. Not that they're worse or better but just that you need to understand where and when they're more likely to make mistakes and how to manage that.

There is an effectively infinite number of possibilities of things people could throw at it and they can't know ahead of time whether your use case will work or not. Even if they told you exactly how it worked, you wouldn't know for sure until you tried it. And giving a vague explanation wouldn't help you either.

Is there documentation (is it possible?) on how to upload a PDF to gpt-4o using the API?

  • I think you have to split it into a page per image and then upload each page separately. That's how I've been doing it.