Comment by michaelbuckbee
15 days ago
I took a picture of a grocery list and then pasted it into ChatGPT to have it written out and it worked flawlessly...until I discovered that I'd messed up the picture when I took it at an angle and had accidentally cut off the first character or two of the bottom half of the list.
ChatGPT just inferred that I wanted the actual full names of the items (aka "flour" instead of "our").
Depending on how you feel about it, this is either an absolute failure of OCR or wildly useful and much better.
This is great until it hallucinates rows in a report with company assets that don't exist - why wouldn't a mining company own some excavation equipment? - and pollutes all future searches with fake data right from the start.
I laugh every time I hear someone tell me how great VLMs are for serious work by themselves. They are amazing tools with a ridiculously fluctuating (and largely undetectable) error rate that need a lot of other tools to keep them above board.
It does seem that companies are able to get reliability in narrow problem domains via prompts, evals and fine tuning.
> It does seem
And therein lies all the problem. The verification required for serious work is likely orders of magnitude more than anybody is willing to spend on.
For example, professional OCR companies have large teams of reviewers who double or triple review everything, and that is after the software itself flags recognition with varying degrees of certainty. I don't think companies are thinking of LLMs as tools that require that level of dedication and resources, in virtually all larger scale use cases.
1 reply →
In some cases this is true, but then why choose an expensive world model over a small net or random forest you trained specifically for the task at hand?
we completely agree - mechanistic interpretability might help keep these language models in check, but it’s going to be very difficult to run this on closed source frontier models. im excited to see where that field progresses
You can literally ask it to mark characters that are not clear instead of inferring them.
Unit tests work remarkably well
> They are amazing tools with a ridiculously fluctuating (and largely undetectable) error rate that need a lot of other tools to keep them above board.
So are human beings. Meaning we've been working around this issue since forever, we're not suddenly caught up in a new thing here.
While humans can and do make mistakes, it seems to me like there is a larger problem here that LLMs make mistakes for different reasons than humans and that those reasons make them much worse than humans at certain types of problems (e.g. OCR). Worse, this weakness might be fundamental to LLM design rather than something that can be fixed by just LLM-ing harder.
I think a lot of this gets lost in the discussion because people insist on using terminology that anthropomorphizes LLMs to make their mistakes sound human. So LLMs are "hallucinating" rather than having faulty output because their lossy, probabilistic model fundamentally doesn't actually "understand" what's being asked of it the way a human would.
2 replies →
Human beings also have an ability to doubt their own abilities and understanding. In the case of transcription, if someone has doubts about what they are transcribing they'll try and seek clarity instead of just making something up. They also have the capacity to know if something is too important to screw up and adjust their behaviour appropriately.
I'm in the "failure" camp, because the true correctness of an answer comes from how it was reached. [0]
The correct (or at least humanly-expected) process would be to identify the presence of mangled word, determine what its missing suffixes could have been, and if some candidate is a clear contextual winner (e.g. "fried chicken" not "dried chicken") use that.
However I wouldn't be surprised if the LLM is doing something like "The OCR data is X. Repeat to me what the OCR data is." That same process could also corrupt things, because it's a license to rewrite anything to look more like its training data.
[0] If that's not true, then it means I must have a supernatural ability to see into the future and correctly determine the result of a coin toss in advance. Sure, the power only works 50% of the time, but you should still worship me for being a major leap in human development. :p
> I'm in the "failure" camp, because the true correctness of an answer comes from how it was reached.
Something I may have believed until I got married. Now I know that "fnu cwken" obviously means "fresh broccoli, because what else could it mean, did I say something about buying chicken, obviously this is not chicken since I asked you to go to produce store and they DON'T SELL CHICKEN THERE".
Seriously though, I'm mostly on the side of "huge success" here, but LLMs sometimes really get overzealous with fixing what ain't broke.
I often think that LLM issues like this could be solved by a final pass of "is the information in this image the same as this text" (ie generally a verification pass).
It might be that you would want to use a different model {non-generative} for that last pass -- which is like the 'array of experts' type approach. Or comparing to your human analogy, like reading back the list to your partner before you leave for the shops.
Isn’t that just following through on “from how it was reached”? Without any of that additional information, if the LLM gave the same result, we should consider it the product of hallucination
On your epistemology, if you correctly guess the outcome of a random event then the statement, even if contingent on an event that did not yet occur, is still true. The same goes for every incorrect guess.
If you claim that you guess correctly 50% of the time then you are, from a Bayesian perspective, starting with a reasonable prior.
You then conflate the usefulness of some guessing skill with logic and statistics.
How this relates to an LLM is that the priors are baked into the LLM so statistics is all that is required to make an educated guess about the contents of a poorly written grocery list. The truthfulness of this guess is contingent on events outside of the scope of the LLM.
How often, applying a scalar value to the statistical outcome of an event, is very important. If your claim is that LLMs are wrong 5O% of the time then you need to update your priors based on some actual experience.
To consider: do we overestimate what we know about how we humans reach an answer? (Humans are very capable of intuitively reading scrambled text, for example, as long as the beginning and ending of each word remains correct.)
The correct way to handle it is to ask the user if it's not clear, like a real assistant would
I once did something similar with a recipe from a cookbook where the recipe started at the bottom of one page and continued onto the next page. It correctly identified the first few ingredients present in the photo of the first page but then proceeded to hallucinate another half-dozen or so ingredients in order to generate a complete recipe.
yup, this is a pretty common occurrence in using LLMs for data extraction. For personal use (trying to load a receipt) it’s great that the LLM filled in info. For production systems which need high quality, near 100% extraction accuracy, inferring results is a failure. Think medical record parsing, financial data, etc These hallucinations occur quite frequently, and we haven’t found a way to minimize this through prompt eng.
It's not possible with current gen models.
To even have a chance at doing it you'd need to start the training from scratch with _huge_ penalties for filling in missing information and a _much_ larger vision component to the model.
See an old post I made on what you need to get above sota OCR that works today: https://news.ycombinator.com/item?id=42952605#42955414
Maybe ask it to return the bounding box of every glyph.
This universally fails, on anything from frontier models to Gemini 2.0 Flash in its custom fine-tuned bounding box extraction mode.
My experiences have been the same… that is to say nothing like what is reported here. This is more pitch than info.
Odd timing, too given flash 2.0 release and its performance on this problem.
Failed as a machine, succeeded as an intelligence. Intelligences don't make good machines though.
I recently took a picture of the ingredients list on my multivitamin and fed it into ChatGPT o1-pro (at least that was the config.) and it made up ingredients and messed up the quantities.
Sounds like a xerox.