Comment by michaelt

1 year ago

> This wasn’t even an OCR benchmark: the task was structured data extraction, and deterministic metrics were set aside in favor of GPT-as-a-judge.

Looks like they've got deterministic metrics to me: For each document they've got a ground truth set of JSON extracted data, and they use json-diff to calculate the fields that disagree.

There is GPT-4o in their evaluation pipeline - but only as a means of converting the OCRed document into their target JSON schema.