Comment by jazzyjackson
21 hours ago
On the model side, sure, instructions are data and data are instructions so it might be massaged to regurgitate its prime directive.
But if I was an API provider that had a secret sauce prompt, it would be pretty simple to throw another outbound regex/lem&stem cosine similarity filter just the same as a "woops model is producing erotica" or "woops model is reproducing the lyrics to stairway to heaven" and drop whatever the fuzzy match was out of the message returned to the caller.
No comments yet
Contribute on Hacker News ↗