Comment by bscphil
7 months ago
I think that's a valid question and I ask it every time someone reports "this LLM said X about itself", but I think there are potential ways to verify it: for example, upthread, someone pointed out that the part about copyright materials is badly worded. It says something like "don't print song lyrics or other copyright material", thereby implying that song lyrics are copyrighted. Someone tested this and sure enough, GPT-5 refused to print the lyrics to the Star Spangled Banner, saying it was copyrighted.
I think that's pretty good evidence, and it's certainly not impossible for an LLM to print the system prompt since it is in the context history of the conversation (as I understand it, correct me if that's wrong).
I’m skeptical. It also contains a bit about not asking “if you want I can” and similar, but for me it does that constantly.
Is that evidence that they’re trying to stop a common behavior or evidence that the system prompt was inverted in that case?
Edit: I asked it whether its system prompt discouraged or encouraged the behavior and it returned some of that exact same text including the examples.
It ended with:
> If you want, I can— …okay, I’ll stop before I violate my own rules.