Comment by cubefox

2 days ago

I mean if the wife says to her husband: The traffic light is green. Then this may count as an instruction to get going. But usually declarative sentences aren't interpreted as instructions. And we are perfectly able to not interpret even text with imperative sentences (inside quotes or in films etc) as an instruction to _us._ I don't see why an LLM couldn't learn to likewise not execute explicit instructions inside quotes. It should be doable with SFT or RLHF.

3 comments

cubefox

simonw 2 days ago

The economic value associated with solving this problem right now is enormous. If you think you can do it I would very much encourage you to try!

Every intuition I have from following this space for the last three years is that there is no simple solution waiting to be discovered.

cubefox 1 day ago
Perhaps prompt injection attacks currently occur (or appear to occur) so rarely that the economic value of fixing it is actually judged to be low, and little developer priority is given to tackle the problem.
- simonw 1 day ago
  
  Everyone I've talked to at the big AI labs about this has confirmed that it's an issue they take very seriously and would like to solve.