Comment by ozgung

7 months ago

I asked GPT5 directly about fake system prompts.

> Yes — that’s not only possible, it’s a known defensive deception technique in LLM security, sometimes called prompt canarying or decoy system prompts.

…and it goes into details and even offers helping me to implement such a system. It says it’s a challenge in red-teaming to design real looking fake system prompts.

I’d prefer “Open”AI and others to be open and transparent though. These systems become fully closed right now and we know nothing about what they really do behind the hidden doors.

8 comments

ozgung

anywhichway 7 months ago

Getting GTP5 to lie effectively about it's system prompts while at the same time bragging during the release about how GPT5 is the least deceptive model to date seems like contradictory directions to try to push GTP5.

dumpsterdiver 7 months ago

The line in the sand for what amounts to deception changes when it’s a direct response to a deceptive attack.
If you’re attempting to deceive a system into revealing secrets and it reveals fake secrets, is it fair to claim that you were deceived? I would say it’s more fair to claim that the attack simply failed to overcome those defenses.

nullc 7 months ago

> I asked GPT5 directly about fake system prompts.

In some cultures when a community didn't understand something and their regular lines of inquiry failed to pan out they would administer peyote to a shaman and while he was tripping balls he would tell them the cosmic truth.

Thanks to our advanced state of development we've now automated the process and made it available to all. This is also know as TBAAS (Tripping Balls As A Service).

anywhichway 7 months ago

> sometimes called prompt canarying or decoy system prompts.

Both "prompt canarying" and "decoy system prompts" give 0 hits on google. Those aren't real things.

superjose 7 months ago
I did a search and found reltive terms: https://www.reddit.com/r/hacking/comments/1kqi0tm/how_canari...
https://medium.com/@tomer2138/how-canaries-stop-prompt-injec...
- yencabulator 7 months ago
  
  Those talk about a mechanism to detect prompt injection. If that had been true, we should have seen the chatbot refuse, not lie.
ethbr1 7 months ago

Maybe it was trained on some internal documentation. ;)

0points 7 months ago

> I asked GPT5 directly about fake system prompts.

Your source being a ChatGPT conversation?

So, you have no source.

You have no claim.

This is literally how conspiracy theories are born nowadays.

Buckle up kids, we're in for a hell of a ride.