← Back to context

Comment by fragmede

5 months ago

Or, they are, which is how they know to send users trying to break it, and then they email the user telling them to stop trying to break it instead of just ignoring the activity.

Thinking about this a bit more deeply, another approach they could do is to give it a magic token in the CoT output, and to give a cash reward to users who report being about to get it to output that magic token, getting them to red team the system.