Comment by llm_trw

15 days ago

This is a response to: https://news.ycombinator.com/item?id=42952605

A fun threat to read for the current hype cycle.

You can tell who is working in the field by the fact they don't use VLMs for OCR and who isn't because they think it's a solved problem.

A question to the authors.

Do you have resources to train any VLMs from scratch? They aren't quite the bests the sota LLMs are and I think they can be made a lot more useful with:

1). Better training data.

2). Larger vision parts of the model.

In short: 2d attention is not something that anyone's doing at scale - that I know of - and is a no brainer for understanding images.

appreciate the feedback. completely agree, tuning and/or training a VLM will definitely produce better ocr extractions. however, it’s notoriously hard to accumulate a really good ground truth labeled dataset of pdf/excel/pptx. there are some resources online especially for tables, with IBM’s labeled table dataset for example. however, we’d guess the same hallucination issues will persist on complex layouts

  • You can generate the data synthetically.

    We never had the budget to do it but I do have some notes somewhere on a 2d context free grammar to generate arbitrarily nested rows/columns and a css styling that got applied to the xhtml output of the grammar. It dynamically generated as much high quality synthetic data as you wanted - but the IBM and similar data sets were plenty big enough for what we could do even on specialist models.

    It depends on what you're doing really. I thought that we'd done pretty well, then someone on HN reached out with a table that spanned 50 pages and I just gave up.

    Feel free to drop an email if you'd like a quick chat. I find the state of table models particularly abysmal for how important they are.