← Back to context

Comment by fpgaminer

15 days ago

A lot of problems jump out to me with this article, particularly with the explanation of multi-modal LLMs. I'll say that I _do_ agree with the thrust of the article. Don't trust LLMs. But they probably should have argued legitimate issues with VLM based OCR, rather than try to talk about how VLMs are somehow fundamentally flawed or something.

> LLMs process images through high-dimensional embeddings, essentially creating abstract representations that prioritize semantic understanding over precise character recognition.

This isn't true. CLIP and its derivatives don't prioritize semantic understanding. They are trained contrastively, which (very roughly speaking) means they need to be able to differentiate similar images. If two images are just white with a few words, the only way to differentiate them is to include the text in the embedding.

Pretrained CLIP models do tend to be a bit lossy in this department, but not by as much as you would think considering they boil an entire image down to something on the order of 768 floats.

> Each step in this pipeline optimizes for semantic meaning while discarding precise visual information.

Again, that ... doesn't make any sense. It's a bit foolhardy to even say _what_ the models do, given that not even the most brilliant ML researchers know. But in broad _hypothesis_, the CLIP pipeline is optimizing being able to pair images with captions amongst a large number of possibilities. Which, again, requires them to surface all kinds of information from the image, and often times requires surfacing specific text from the image. How else would it differentiate powerpoint slides? Math problems in images? Etc.

> Fixed patch sizes may split individual characters

This doesn't matter. We know from empirical evidence. But even if it _did_, there's plenty of vision models that use overlapping patches.

> Position embeddings lose fine-grained spatial relationships

This isn't true. The model is fully aware of the position of pixels within patches, and the position embedding is merely to tell it the position of the patches themselves within the image. Therefore it can derive the absolute position of every pixel, if it needs to. In fact, we have proof they can and do.

> losing the ability to have human-in-the-loop evaluations, confidence scores, and bounding box outputs.

You get confidence scores for free because the model is explicitly trained to provide cosine similarity scores.

OWLv2 is a CLIP based open vocabulary bounding box model (from Google, makers of Gemini). It's finetuned from a standard, pretrained CLIP model. Nothing really special about the vision architecture; just that it gets finetuned to output bounding boxes. And it beats the pants off YOLO while being open vocabulary to boot. So not only are CLIP-like models capable of outputting bounding boxes, but OWLv2 was trained with human-in-the-loop processes and outputs confidence scores.

Oh and there's Florence, which is a VLM trained on bounding boxes.

> Favor common words over exact transcription

Nothing about LLMs indicates that. In fact, pretrained LLMs favor exact transcription.

> "Correct" perceived errors in the source document

Which OCR systems need to do to be useful for many applications. I get the argument that LLMs are a blackbox in this regard, which is a legitimate criticism, but correcting mistakes is not fundamentally the issue. It's better to say that LLMs _blindly_ correct issues. Whereas, perhaps, one could say a traditional OCR system can report "this is my exact transcription, I corrected it to this" and have various knobs to tweak thresholds. But there's no reason VLMs can't do that too.

> Merge or reorder information based on learned patterns

LLMs are perfectly capable of regurgitating data verbatim. That's perhaps the first thing they learn to do to get loss down. That's what all long context models are benchmarked against.

> Produce different outputs for the same input due to sampling

You can turn off sampling, and then they are deterministic. Or you can output the logits to the user, which gives you effectively confidence scores on its transcription.

And a well trained LLM for this task isn't really "probabilistic" in the sense that its outputs are completely different each time. If it's trained and prompted specifically to transcribe a document, that's what it's going to do. Any variations in output at that point are a result of real vagaries either in the document, vision, or the user request.

If a user wants consistency, they merely need to ask for it. Or the VLM needs to be trained better. In either case, these models are _capable_ of it.

It's most important to note here that, outside of pretrained LLMs, all LLMs that users interact with are Reinforcement trained. So while they were next token prediction trained during _pretraining_, they get trained to seek reward in production. That vastly trims the logits and focuses the model explicitly on performing tasks. Well trained, production LLMs only really put probability fields around tokens that are legitimately valid for the task at hand (bounded by the LLM's intelligence, of course).

> Unlike traditional OCR systems that fail obviously when uncertain, LLMs make educated guesses that appear plausible but may be entirely wrong.Consider the sequence "rn" versus "m". To a human reader scanning quickly, or an LLM processing image patches, these can appear nearly identical. The model, trained on vast amounts of natural language, will tend toward the statistically more common "m" when uncertain.

Again, LLMs don't just regurgitate the most "common" stuff. They are context specific. Besides, it's the vision module that would be making the differentiation here between rn and m. A vision module that is likely neither better nor worse than the vision modules traditional OCR systems are using. (Of course, the LLM may process the vision module's output and notice that perhaps it mis-transcribed "rn" vs "m" and "correct" it. But correct it based on _context_ not on some simplistic statistical model as suggested.)

> There’s a great paper from July 2024 (millennia ago in the world of AI) titled “Vision language models are blind” that emphasizes shockingly poor performance on visual tasks a 5 year old could do

Absolutely. I work in this field, and these vision models are not at the same level as their language counterparts. Due in large part to a lack of good data, good training processes, and good benchmarks. The Cambrian-1 paper is quite insightful here, as it studies the vision benchmarks themselves (https://arxiv.org/abs/2406.16860). The TLDR is that most of the vision benchmarks are actually just text benchmarks, and performance barely degrades when the model is blinded. I've found the same to be true of almost all publicly available training datasets for vision models, which is likely why these models don't learn good, robust visual understandings.

That doesn't really speak to the fundamental capabilities of the vision models. It speaks to the lack of training them well. So, if a model is explicitly trained to do OCR using lots of high quality ground truth data (which is easy to get and generate), then their performance can, and does, excel.

---

Now, all of that said, I also don't agree with the prior post this post is in response to. I work with VLMs a lot as part of my research, and I can assure you that they are nowhere near human level on OCR. They can exceed human performance in very specific tasks at the moment, but that's about it.

Are they better than other OCR offerings? As of this moment, I would tend to trust someone who does OCR for a living, so if Pulse says VLMs aren't as good as their solution, I would probably trust that over someone else saying VLMs work for their specific application. And VLMs _absolutely_ come with a myriad of caveats. They aren't as reliable as a more mechanical OCR system. Expect something like GPT4o to completely glitch 1 in every 10,000 queries. And expect them to be "weird". GPT4o will tend to not fully follow instructions maybe 1 in 100 times, so you might get your document back in the wrong format, or have "Sure, I can help with that!" at the start of your document, etc. Gemini tends to have better instruction following, but I don't have a good assessment of its reliability yet.

If I, personally, had a small project that needed OCR, I'd use Tesseract if it's just PDFs or something like that with printed text. If it's something with weird fonts, fancy stuff, handwriting, math formulas, etc. I might give Gemini a try. If it's mission critical, pay an expert to do it, whether that's in-house or paying a service explicitly built for the purpose.

---

NOTE: One thing that got glossed over in the article is that VLMs are not trained on the "embeddings" of the vision model, per se. CLIP processes the images as N number of tokens across L number of layers. At the end, you have N embeddings. For traditional CLIP, the last (or first) embedding is used as the result. Modern CLIPs average the embeddings together. Tomato, tomato.

VLMs are not trained on that single embedding from CLIP. The "head" gets stripped off, and the VLMs get trained on all N processed tokens from CLIP. So they have access to much more information. The vision models also get finetuned during the training of the VLM, and, importantly, CLIP architectures use skip connections throughout. So there is a direct path for the LLM to access pretty much anything from the vision model that it needs, and optimize for any information it needs.

The size of the embedded information given to the LLM, then, is almost about the same as the number of pixels from the source image. For example it might be something like a 384x384x3 image (442,368 dimensions) getting baked down into something like a 150,000 dimensional vector. So it's really not a fundamentally lossy process at that point.