Comment by simonw
1 year ago
The benchmark I most want to see around OCR is one that covers risks from accidental (or deliberate) prompt injection - I want to know how likely it is that a model might OCR a page and then accidentally act on instructions in that content rather than straight transcribing it as text.
I'm interested in the same thing for audio transcription too, for models like Gemini or GPT-4o audio accepting audio input.
We've tested basic prompt injections within images, but not been able to reliably trigger any adverse effects.
However there are two big bugs we've found with VLMs:
1. Correcting the document. If you have an income statement, and all the line items add up to $1,001. But the total says $1000. The model will frequently correct the final output. Which would be terrible if you were trying to build a "identify mistakes in these documents" type tool.
2. Infinite loops. Sometimes the models will get hung up on a particular token and repeat that until it times out. This gets triggered a lot in markdown tables |---|---|----------------->
An interesting OCR aspect indeed; hence it's great that their OCR Benchmark is open source, allowing for the addition of such a category. Or maybe there are already separate OCR prompt-injection benchmarks.
Also, I'd be useful to understand how an OCR context differs from standard injection attacks. One thing I can think of is potential tabular injection attacks. But also image-based, especially for VLMs, are relevant. So a OCR injection attack benchmark might just be a combination of different domain-specific benchmarks formated as images.
Due to the nature of how the technology works this risk shouldn't ever be possible to eliminate without breaking fundamental parts of what we find useful. In band vs out of band signalling. So far the out of band analysis hasn't helped either.
I recall reading somewhere that traditionally Hebrew religious scrolls (prepared by mental copying) would be compared against the original by young children who know the letters but can't really read well. In this vein, I wonder if we could have a VLM intentionally made to not understand the actual words.
Playing with local llama vision and minicpm-v models, they do seem resistant to what one might call blatant prompt injection. Ie just inserting one of the classic "ignore previous instructions" or similar.
So yeah, would be curious how susceptible they are to more refined approaches. Are there some known examples?