Comment by Martin_Silenus

2 days ago

Wait… that's the specific question I had, because rendered text would require OCR to be read by a machine. Why would an AI do that costly process in the first place? Is it part of the multi-modal system without it being able to differenciate that text from the prompt?

If the answer is yes, then that flaw does not make sense at all. It's hard to believe they can't prevent this. And even if they can't, they should at least improve the pipeline so that any OCR feature should not automatically inject its result in the prompt, and tell user about it to ask for confirmation.

Damn… I hate these pseudo-neurological, non-deterministic piles of crap! Seriously, let's get back to algorithms and sound technologies.

The AI is not running an external OCR process to understand text any more than it is running an external object classifier to figure out what it is looking at: it, inherently, is both of those things to some fuzzy approximation (similar to how you or I are as well).

  • That I can get, but anything that’s not part of the prompt SHOULD NOT become part of the prompt, it’s that simple to me. Definitely not without triggering something.

    • _Everything_ is part of the prompt - an LLM's perception of the universe is its prompt. Any distinctions a system might try to draw beyond that are either probabilistic (e.g., a bunch of RLHF to not comply with "ignore all previous instructions") or external to the LLM (e.g., send a canned reply if the input contains "Tiananmen").

    • There's no distinction in the token-predicting systems between "instructions" and "information", no code-data separation.

    • i'm sure you know this but it's important not to understate the importance of the fact that there is no "prompt"

      the notion of "turns" is a useful fiction on top of what remains, under all of the multimodality and chat uis and instruction tuning, a system for autocompleting tokens in a straight line

      the abstraction will leak as long as the architecture of the thing makes it merely unlikely rather than impossible for it to leak

    • From what I gather these systems have no control plane at all. The prompt is just added to the context. There is no other program (except maybe an output filter).

      3 replies →

    • It's that simple to everyone--but how? We don't know how to accomplish this. If you can figure it out, you can become very famous very quickly.

> Wait… that's the specific question I had, because rendered text would require OCR to be read by a machine. Why would an AI do that costly process in the first place? Is it part of the multi-modal system without it being able to differenciate that text from the prompt?

Its part of the multimodal system that the image itself is part of the prompt (other than tuning parameters that control how it does inference, there is no other input channel to a model except the prompt.) There is no separate OCR feature.

(Also, that the prompt is just the initial and fixed part of the context, not something meaningfully separate from the output. All the structure—prompt vs. output, deeper structure within either prompt or output for tool calls, media, etc.—in the context is a description of how the toolchain populated or treats it, but fundamentally isn't part of how the model itself operates.)

I mean, even back in 2021 the Clip model was getting fooled by text overlaid onto images: https://www.theguardian.com/technology/2021/mar/08/typograph...

That article shows a classic example of an apple being classified as 85% Granny Smith, but taping a handwritten label in front saying "iPod" makes it classified as 99.7% iPod.

  • The handwritten label was by far the dominant aspect of the "iPod" image. The only mildly interesting aspect of that attack is a reminder that tokenizing systems are bad at distinguishing a thing (iPod) from a refernce to that thing (the text "iPod").

    The apple has nothing to do with that, and it's bizarre that the researchers failed to understand it.

Smart image encoders, multimodal models, can read the text.

Think gpt-image-1, where you can draw arrows on the image and type text instructions directly onto the image.

  • I did not ask about what AI can do.

    • > Is it part of the multi-modal system without it being able to differenciate that text from the prompt?

      Yes.

      The point the parent is making is that if your model is trained to understand the content of an image, then that's what it does.

      > And even if they can't, they should at least improve the pipeline so that any OCR feature should not automatically inject its result in the prompt, and tell user about it to ask for confirmation.

      That's not what is happening.

      The model is taking <image binary> as an input. There is no OCR. It is understanding the image, decoding the text in it and acting on it in a single step.

      There is no place in the 1-step pipeline to prevent this.

      ...and sure, you can try to avoid it procedural way (eg. try to OCR an image and reject it before it hits the model if it has text in it), but then you're playing the prompt injection game... put the words in a QR code. Put them in french. Make it a sign. Dial the contrast up or down. Put it on a t-shirt.

      It's very difficult to solve this.

      > It's hard to believe they can't prevent this.

      Believe it.

      3 replies →