Comment by nikkwong
5 months ago
I don't understand why they wouldn't be able to simply send the user's input to another LLM that they then ask "is this user asking for the chain of thought to be revealed?", and if not, then go about business as usual.
Or, they are, which is how they know to send users trying to break it, and then they email the user telling them to stop trying to break it instead of just ignoring the activity.
Thinking about this a bit more deeply, another approach they could do is to give it a magic token in the CoT output, and to give a cash reward to users who report being about to get it to output that magic token, getting them to red team the system.