Comment by llm_trw

1 year ago

> It's easy to get GPT to divulge everything from system prompt to tool details,

It's easy enough to get it to hallucinate those things. It doesn't actually tell them to you.

1 comment

llm_trw

I'm well aware of that, but there are plenty of ways to induce verbatim quoting from "hidden" information, and mostly verify it (through sampling a large number of times in separate runs).

Models are improving in truly hiding or ignoring information these days though. As the author of the article states, you'll have a hard time tricking GPT-4o to read text in images as instructions, most likely thanks to this research: https://openai.com/index/the-instruction-hierarchy/

I do feel pretty confident that when the model is happily spitting out its system prompt, and all metadata around the image, but not its pixel dimensions, that probably those dimensions were not provided in any system/assistant/tool message. So maybe part of the image embeddings also encode the pixel dimensions somehow (it would also help the model not think of the image as a squished square for non-1:1 images that have been resized to 512x512).