Comment by j_maffe
5 months ago
I think once a small corpus of examples of CoT gets around, people will be able to reverse-engineer it.
5 months ago
I think once a small corpus of examples of CoT gets around, people will be able to reverse-engineer it.
They will but they also (seem to?) get trained in to each model update (of which there are many minor versions of each major release). I wonder how they approach API model pinning though, perhaps the safety check is separated from the main parts of the model and can be layered in.
The other part of the massive volume issue is it's not just "what clever prompts can skirt around detection sometimes" it's "detection, like the rest of it, doesn't seem to work for 100% of outputs so throwing the same 'please do it anyways' in enough times can get you by if you're dedicated enough" type problem.