Comment by r_lee
4 hours ago
yeah, I'm confused as well, why would the models hold any memory about red teaming attempts etc? Or how the training was conducted?
I'm really curious as to what the point of this paper is..
4 hours ago
yeah, I'm confused as well, why would the models hold any memory about red teaming attempts etc? Or how the training was conducted?
I'm really curious as to what the point of this paper is..
I'm genuinely ignorant of how those red teaming attempts are incorporated into training, but I'd guess that this kind of dialogue is fed in something like normal training data? Which is interesting to think about: they might not even be red-team dialogue from the model under training, but still useful as an example or counter-example of what abusive attempts look like and how to handle them.
Are we sure there isn't some company out there crazy enough to feed all it's incoming prompts back into model training later?