← Back to context

Comment by anon373839

1 year ago

I’m definitely not getting that takeaway. This wasn’t even an OCR benchmark: the task was structured data extraction, and deterministic metrics were set aside in favor of GPT-as-a-judge.

VLMs are every bit as susceptible to the (unsolved) hallucination problem as regular LLMs are. I would not use them to do OCR on anything important because the failure modes are totally unbounded (unlike regular OCR).

> This wasn’t even an OCR benchmark: the task was structured data extraction, and deterministic metrics were set aside in favor of GPT-as-a-judge.

Looks like they've got deterministic metrics to me: For each document they've got a ground truth set of JSON extracted data, and they use json-diff to calculate the fields that disagree.

There is GPT-4o in their evaluation pipeline - but only as a means of converting the OCRed document into their target JSON schema.

Also, what's strange is there's no free of paid OCR engine is added to the mix for the evaluation. Tessaract is built specially for OCR'in scanned documents, and it has a both traditional and neural network based modes, to boot.

  • > Also, what's strange is there's no free of paid OCR engine is added to the mix for the evaluation.

    The article says they evaluated "Traditional OCR providers (Azure, AWS Textract, Google Document AI, etc.)"

    Are those not paid OCR engines?

    • You're absolutely correct. I read the article quite fast, and assumed they are AI, albeit not LLM powered systems as well.

      I'm using computers since I can read, and when somebody says "traditional OCR", I think about the older systems like Tessaract or ABBYY's FineReader which can be again automated for batch processing, albeit mostly locally.

      Sending huge amount of PDFs to a cloud server to get them processed is still a bit alien to me, since it can be done on-premises (or on a VPS with the said software) very efficiently from my perspective.