Comment by satvikpendem
7 days ago
There is no way to get rid of a prompt injection attack. There are always ways to convince the AI to do something else besides flagging a post even if that's its initial instruction.
7 days ago
There is no way to get rid of a prompt injection attack. There are always ways to convince the AI to do something else besides flagging a post even if that's its initial instruction.
The raw text of the persons message can/will be posted to the forum and be obvious to the community if it’s a prompt injection to be flagged for human review and their account banned.
Sure, that's if human moderators see it before the AI, in which case, why have an AI at all? I presume in this solution that the AI is running all the time and it will see messages the instant they're sent and thus will always be vulnerable to a prompt injection attack before any human even sees it in the first place.
To moderate the majority of the community that will not be attempting prompt injections.
What meaningful vulnerabilities are there if the post can only be accepted/rejected/flaggedForHumanReview?
6 replies →