Comment by ritvikpandey21
15 days ago
appreciate the feedback. completely agree, tuning and/or training a VLM will definitely produce better ocr extractions. however, it’s notoriously hard to accumulate a really good ground truth labeled dataset of pdf/excel/pptx. there are some resources online especially for tables, with IBM’s labeled table dataset for example. however, we’d guess the same hallucination issues will persist on complex layouts
You can generate the data synthetically.
We never had the budget to do it but I do have some notes somewhere on a 2d context free grammar to generate arbitrarily nested rows/columns and a css styling that got applied to the xhtml output of the grammar. It dynamically generated as much high quality synthetic data as you wanted - but the IBM and similar data sets were plenty big enough for what we could do even on specialist models.
It depends on what you're doing really. I thought that we'd done pretty well, then someone on HN reached out with a table that spanned 50 pages and I just gave up.
Feel free to drop an email if you'd like a quick chat. I find the state of table models particularly abysmal for how important they are.