Comment by TerryBenedict
18 hours ago
Right, so a filter that sits behind the model and blocks certain undesirable responses. Which you have to assume is something the creators already have, but products built on top of it would want the knobs turned differently. Fair enough.
I'm personally somewhat surprised that things like system prompts get through, as that's literally a known string, not a vague "such and such are taboo concepts". I also don't see much harm in it, but given _that_ you want to block it, do you really need a whole other network for that?
FWIW by "input" I was referring to what the other commenter mentioned: it's almost certainly explicitly present in the training set. Maybe that's why "leetspeak" works -- because that's how the original authors got it past the filters of reddit, forums, etc?
If the model can really work out how to make a bomb from first principles, then they're way more capable than I thought. And, come to think of it, probably also clever enough to encode the message so that it gets through...
No comments yet
Contribute on Hacker News ↗