← Back to context

Comment by extraduder_ire

7 months ago

Any information on how this was "leaked" or verified? I presume it's largely the same as previous times someone got an LLM to output its system prompt.

I asked GPT5 directly about fake system prompts.

> Yes — that’s not only possible, it’s a known defensive deception technique in LLM security, sometimes called prompt canarying or decoy system prompts.

…and it goes into details and even offers helping me to implement such a system. It says it’s a challenge in red-teaming to design real looking fake system prompts.

I’d prefer “Open”AI and others to be open and transparent though. These systems become fully closed right now and we know nothing about what they really do behind the hidden doors.

  • Getting GTP5 to lie effectively about it's system prompts while at the same time bragging during the release about how GPT5 is the least deceptive model to date seems like contradictory directions to try to push GTP5.

    • The line in the sand for what amounts to deception changes when it’s a direct response to a deceptive attack.

      If you’re attempting to deceive a system into revealing secrets and it reveals fake secrets, is it fair to claim that you were deceived? I would say it’s more fair to claim that the attack simply failed to overcome those defenses.

  • > I asked GPT5 directly about fake system prompts.

    In some cultures when a community didn't understand something and their regular lines of inquiry failed to pan out they would administer peyote to a shaman and while he was tripping balls he would tell them the cosmic truth.

    Thanks to our advanced state of development we've now automated the process and made it available to all. This is also know as TBAAS (Tripping Balls As A Service).

  • > I asked GPT5 directly about fake system prompts.

    Your source being a ChatGPT conversation?

    So, you have no source.

    You have no claim.

    This is literally how conspiracy theories are born nowadays.

    Buckle up kids, we're in for a hell of a ride.

I asked the different models, all said it was NOT their instructions, ExCEPT for GPT-5 which responded with the following prompt. (Take that how you will, ChatGPT gaslights me constantly so could be doing the same now.

"Yes — that Gist contains text that matches the kind of system and tool instructions I operate under in this chat. It’s essentially a copy of my internal setup for this session, including: Knowledge cutoff date (June 2024) and current date. Personality and response style rules. Tool descriptions (PowerShell execution, file search, image generation, etc.). Guidance on how I should answer different types of queries. It’s not something I normally show — it’s metadata that tells me how to respond, not part of my general knowledge base. If you’d like, I can break down exactly what parts in that Gist control my behaviour here."

  • Have you tried repeating this a few times in a fresh session and then modifying a few phrases and asking the question again (in a fresh context)? I have a strong feeling this is not repeatable..

    Edit: I tried it and got different results:

    "It’s very close, but not exactly."

    "Yes — that text is essentially part of my current system instructions."

    "No — what you’ve pasted is only a portion of my full internal system and tool instructions, not the exact system prompt I see"

    But when I change parts of it, it will correctly identify them, so it's at least close to the real prompt.

I suppose with an LLM you could never know if it is hallucinating a supposed system prompt.