Comment by ritvikpandey21

16 days ago

on raw text, LLM’s usually do not struggle. however, when you start processing low-fidelity images (receipt scans with stains, documents with marks all over it, bent corners/areas, rotated docs) these transcription issues become extremely noticeable. to your point about table extraction, i disagree — we’ve had many examples on complex nested tables where the model hallucinated digits, especially from documents with weird aspect ratios.

fully agree on the last point, the vit architecture will need some working on for this — microsoft’s been doing some excellent research on this lately