← Back to context

Comment by Retr0id

2 years ago

Sometimes it "apologizes" rather than saying "sorry", you could build a fairly solid heuristic but I'm not sure you can catch every possible phrasing.

OpenAI could presumably add a "did the safety net kick in?" boolean to API responses, and, also presumably, they don't want to do that because it would make it easier to systematically bypass.

> OpenAI could presumably add a "did the safety net kick in?" boolean to API responses, and, also presumably, they don't want to do that because it would make it easier to systematically bypass.

Is a safety net kicking in or is the model just trained to respond with a refusal to certain prompts? I am fairly sure it's usually the latter, and in that case even OpenAI can't be sure a particular response is a refusal or not.

Just feed the text to a new ChatGPT conversation and ask it whether the text is an apology or a product description.

Or do traditional NLP, but letting ChatGPT classify your text is less effort to set up

Why not have a separate chat request to apology-check the responses?

Not my original idea, there was a link from HN where the dev did just that.