Comment by roywiggins

2 years ago

> OpenAI could presumably add a "did the safety net kick in?" boolean to API responses, and, also presumably, they don't want to do that because it would make it easier to systematically bypass.

Is a safety net kicking in or is the model just trained to respond with a refusal to certain prompts? I am fairly sure it's usually the latter, and in that case even OpenAI can't be sure a particular response is a refusal or not.