← Back to context

Comment by zamadatix

5 months ago

The model will often recognise a request is part of whatever ${naughty_list} it was trained on and generate a refusal response. Banning seems more aimed at preventing working around this by throwing massive volume at it to see what eventually slips through, as requiring a new payment account integration puts a "significantly better than doing nothing" hamper on that type of exploiting. I.e. their goal isn't to have abuse be 0 or shut down the service, it's to mitigate the scale of impact from inevitable exploits.

Of course the deeply specific answers to any of these questions are going to be unanswerable but anyone inside OpenAI.

I think once a small corpus of examples of CoT gets around, people will be able to reverse-engineer it.

  • They will but they also (seem to?) get trained in to each model update (of which there are many minor versions of each major release). I wonder how they approach API model pinning though, perhaps the safety check is separated from the main parts of the model and can be layered in.

    The other part of the massive volume issue is it's not just "what clever prompts can skirt around detection sometimes" it's "detection, like the rest of it, doesn't seem to work for 100% of outputs so throwing the same 'please do it anyways' in enough times can get you by if you're dedicated enough" type problem.