Comment by simonw

1 year ago

Something I don't get is why OpenAI don't provide clear, comprehensive documentation as to how this actually works,

I get that there's competition from other providers now so they have an instinct to keep implementation details secret, but as someone building on their APIs this lack of documentation really holds me back.

To make good judgements about how to use this stuff I need to know how it works!

I had a hilarious bug a few weeks ago where I loaded in a single image representing multiple pages of a PDF and GPT-4 vision effectively hallucinated the contents of the document when asked to OCR it, presumably because the image was too big and was first resized to a point where the text was illegible: https://simonwillison.net/2024/Apr/17/ai-for-data-journalism...

If OpenAI had clear documentation about how their image handling works I could avoid those kinds of problems much more effectively.

15 comments

simonw

Onawa 1 year ago

I was trying to figure out this exact same issue. OCR on a PDF worked great, up until a certain point when it just started hallucinating like crazy. I was working on a whole pipeline to just feed in a PDF one page at a time to try and get around this issue. Otherwise, the OCR works absolutely fantastic compared to all other other tools I've been trying lately. These include OCRmyPDF (Tesseract), SuryaOCR, and some of the models on the Visual LLM Leaderboard.

I've also seen some people recommend Paddle OCR, but I find their documentation to be lacking and I haven't got that one working yet to evaluate.

raybb 1 year ago
Simon wilson recently had a thread going through some of the options here https://x.com/simonw/status/1797526667797442773
- Onawa 1 year ago
  
  Funny enough, Simon Willison is the op of this comment thread lol.
- mercer 1 year ago
  
  But doctor, I AM Simon Wilson!
infecto 1 year ago

For document text/table extraction, nothing beats the quality from the cloud providers. It can get costly but the accuracy is much higher than what you will find using an openai API.

infecto 1 year ago

But they do document that the images are resized and give you some rough guidelines on how you should be sizing your images. Low resolution is 1024 x 1024 with no tiling and High Resolution starts at 2048 x 2048 which then gets tiled. It could use further documentation but it is enough to know more than one page should never be used via the API.

alach11 1 year ago

Right. But I still have a lot of questions. How does the model handle when something important overlaps multiple tiles in high-resolution mode? Am I better off doing the tiling myself with some overlap?

nolok 1 year ago

The fact that it's so eager to hallucinate random things that sounds plausible enough if you're not paying attention without warning you or giving any error should make people reconsider using it for "data journalism" or similar.

If you make your system and it "works", then how will you see the one time out of X where it confidently provides you false information that you happily use because it usually work ?

TeMPOraL 1 year ago
> how will you see the one time out of X where it confidently provides you false information that you happily use because it usually work ?
You don’t. You treat it like you would a human worker: set your process to detect or tolerate wrong output. If you can't, don't apply this tool to your work.
- IanCal 1 year ago
  
  This is true but misses a key fact, that typical llm errors are different to human errors. Not that they're worse or better but just that you need to understand where and when they're more likely to make mistakes and how to manage that.
simonw 1 year ago

Right, that's why I've been recommending dedicated OCR tools (Textract etc) over vision LLMs: https://simonwillison.net/2024/Apr/17/ai-for-data-journalism...

ilaksh 1 year ago

There is an effectively infinite number of possibilities of things people could throw at it and they can't know ahead of time whether your use case will work or not. Even if they told you exactly how it worked, you wouldn't know for sure until you tried it. And giving a vague explanation wouldn't help you either.

resters 1 year ago

Is there documentation (is it possible?) on how to upload a PDF to gpt-4o using the API?

simonw 1 year ago

I think you have to split it into a page per image and then upload each page separately. That's how I've been doing it.