Comment by ritvikpandey21

5 months ago

appreciate the feedback. completely agree, tuning and/or training a VLM will definitely produce better ocr extractions. however, it’s notoriously hard to accumulate a really good ground truth labeled dataset of pdf/excel/pptx. there are some resources online especially for tables, with IBM’s labeled table dataset for example. however, we’d guess the same hallucination issues will persist on complex layouts

1 comment

ritvikpandey21

llm_trw 5 months ago

You can generate the data synthetically.

We never had the budget to do it but I do have some notes somewhere on a 2d context free grammar to generate arbitrarily nested rows/columns and a css styling that got applied to the xhtml output of the grammar. It dynamically generated as much high quality synthetic data as you wanted - but the IBM and similar data sets were plenty big enough for what we could do even on specialist models.

It depends on what you're doing really. I thought that we'd done pretty well, then someone on HN reached out with a table that spanned 50 pages and I just gave up.

Feel free to drop an email if you'd like a quick chat. I find the state of table models particularly abysmal for how important they are.