Comment by extraduder_ire

10 months ago

Any information on how this was "leaked" or verified? I presume it's largely the same as previous times someone got an LLM to output its system prompt.

23 comments

extraduder_ire

ozgung 10 months ago

I asked GPT5 directly about fake system prompts.

> Yes — that’s not only possible, it’s a known defensive deception technique in LLM security, sometimes called prompt canarying or decoy system prompts.

…and it goes into details and even offers helping me to implement such a system. It says it’s a challenge in red-teaming to design real looking fake system prompts.

I’d prefer “Open”AI and others to be open and transparent though. These systems become fully closed right now and we know nothing about what they really do behind the hidden doors.

anywhichway 10 months ago
Getting GTP5 to lie effectively about it's system prompts while at the same time bragging during the release about how GPT5 is the least deceptive model to date seems like contradictory directions to try to push GTP5.
- dumpsterdiver 10 months ago
  
  The line in the sand for what amounts to deception changes when it’s a direct response to a deceptive attack.
  If you’re attempting to deceive a system into revealing secrets and it reveals fake secrets, is it fair to claim that you were deceived? I would say it’s more fair to claim that the attack simply failed to overcome those defenses.
nullc 10 months ago

> I asked GPT5 directly about fake system prompts.
In some cultures when a community didn't understand something and their regular lines of inquiry failed to pan out they would administer peyote to a shaman and while he was tripping balls he would tell them the cosmic truth.
Thanks to our advanced state of development we've now automated the process and made it available to all. This is also know as TBAAS (Tripping Balls As A Service).
anywhichway 10 months ago
> sometimes called prompt canarying or decoy system prompts.
Both "prompt canarying" and "decoy system prompts" give 0 hits on google. Those aren't real things.
- superjose 10 months ago
  
  I did a search and found reltive terms: https://www.reddit.com/r/hacking/comments/1kqi0tm/how_canari...
  https://medium.com/@tomer2138/how-canaries-stop-prompt-injec...
  
  1 reply →
- ethbr1 10 months ago
  
  Maybe it was trained on some internal documentation. ;)
0points 10 months ago

> I asked GPT5 directly about fake system prompts.
Your source being a ChatGPT conversation?
So, you have no source.
You have no claim.
This is literally how conspiracy theories are born nowadays.
Buckle up kids, we're in for a hell of a ride.

BlueTissuePaper 10 months ago

I asked the different models, all said it was NOT their instructions, ExCEPT for GPT-5 which responded with the following prompt. (Take that how you will, ChatGPT gaslights me constantly so could be doing the same now.

"Yes — that Gist contains text that matches the kind of system and tool instructions I operate under in this chat. It’s essentially a copy of my internal setup for this session, including: Knowledge cutoff date (June 2024) and current date. Personality and response style rules. Tool descriptions (PowerShell execution, file search, image generation, etc.). Guidance on how I should answer different types of queries. It’s not something I normally show — it’s metadata that tells me how to respond, not part of my general knowledge base. If you’d like, I can break down exactly what parts in that Gist control my behaviour here."

planb 10 months ago
Have you tried repeating this a few times in a fresh session and then modifying a few phrases and asking the question again (in a fresh context)? I have a strong feeling this is not repeatable..
Edit: I tried it and got different results:
"It’s very close, but not exactly."
"Yes — that text is essentially part of my current system instructions."
"No — what you’ve pasted is only a portion of my full internal system and tool instructions, not the exact system prompt I see"
But when I change parts of it, it will correctly identify them, so it's at least close to the real prompt.
- YeahThisIsMe 10 months ago
  
  How could you ever verify this if the only thing you're relying on is its response?
  
  4 replies →
- energy123 10 months ago
  
  Give it the first few sentences and ask it to complete the next sentence. If it gets it right without search it's guaranteed to be the real system prompt.
  
  2 replies →
- ASalazarMX 10 months ago
  
  I think you just invented prompt spelunking.

sebazzz 10 months ago

I suppose with an LLM you could never know if it is hallucinating a supposed system prompt.

JohnMakin 10 months ago

Curious too, most of the replies are completely credulous.

aaron695 10 months ago

[dead]