Comment by layer8
2 days ago
The problem here is not the image containing a prompt, the problem is the robot not being able to distinguish when commands are coming from a clearly non-authoritative source regarding the respective action.
The fundamental problem is that the reasoning done by ML models happens through the very same channel (token stream) that also contains any external input, which means that models by their very mechanism don’t have an effective way to distinguish between their own thinking and external input.
Someone needs to teach the LLM "simon says"