← Back to context

Comment by levocardia

15 days ago

LLMs do not struggle at all with raw text: they never lose decimal places or drop digits when transcribing a table from raw text. So the problem is not the internal representation. I do this all the time and all major LLMs work eminently well at it.

The problem comes from the vision part. Either (a) the ViT architecture needs a rework, or (b) the vision models need more training on tasks of the "copy this" nature versus the "do this" nature.

on raw text, LLM’s usually do not struggle. however, when you start processing low-fidelity images (receipt scans with stains, documents with marks all over it, bent corners/areas, rotated docs) these transcription issues become extremely noticeable. to your point about table extraction, i disagree — we’ve had many examples on complex nested tables where the model hallucinated digits, especially from documents with weird aspect ratios.

fully agree on the last point, the vit architecture will need some working on for this — microsoft’s been doing some excellent research on this lately