Comment by michaelt
1 year ago
> This wasn’t even an OCR benchmark: the task was structured data extraction, and deterministic metrics were set aside in favor of GPT-as-a-judge.
Looks like they've got deterministic metrics to me: For each document they've got a ground truth set of JSON extracted data, and they use json-diff to calculate the fields that disagree.
There is GPT-4o in their evaluation pipeline - but only as a means of converting the OCRed document into their target JSON schema.
No comments yet
Contribute on Hacker News ↗