Comment by rtpg

7 months ago

So people say that they reverse engineer the system to get the system prompt by asking the machine, but like... is that actually a guarantee of anything? Would a system with "no" prompt just spit out some random prompt?

15 comments

rtpg

int_19h 7 months ago

There are ways to do it in such a way that you can be reasonably assured.

For GPT-4, I got its internal prompt by telling it to simulate a Python REPL, doing a bunch of imports of a fictional chatgpt module, using it in "normal" way first, then "calling" a function that had a name strongly implying that it would dump the raw text of the chat. What I got back included the various im_start / im_end tokens and other internal things that ought to be present.

But ultimately the way you check whether it's a hallucination or not is by reproducing it in a new session. If it gives the same thing verbatim, it's very unlikely to be hallucinated.

mvdtnz 7 months ago
> If it gives the same thing verbatim, it's very unlikely to be hallucinated
Why do you believe this?
- persolb 7 months ago
  
  In order to consistently output the same fake prompt, that fake prompt would need to be part of GPT’s prompt…. In which case it wouldn’t be fake.
  You can envision some version of post LLM find/replace, but then the context wouldn’t match if you asked it a direct non-exact question.
  And most importantly, you can just test each of the instructions and see how it reacts.
- int_19h 7 months ago
  
  Think about how hallucinations happen, and what it would take for the model to consistently hallucinate the same exact (and long) sequence of tokens verbatim given non-zero temp and semantic-preserving variations in input.
- littlestymaar 7 months ago
  
  Are consistently repeated hallucinations a thing?

bscphil 7 months ago

I think that's a valid question and I ask it every time someone reports "this LLM said X about itself", but I think there are potential ways to verify it: for example, upthread, someone pointed out that the part about copyright materials is badly worded. It says something like "don't print song lyrics or other copyright material", thereby implying that song lyrics are copyrighted. Someone tested this and sure enough, GPT-5 refused to print the lyrics to the Star Spangled Banner, saying it was copyrighted.

I think that's pretty good evidence, and it's certainly not impossible for an LLM to print the system prompt since it is in the context history of the conversation (as I understand it, correct me if that's wrong).

https://news.ycombinator.com/item?id=44833342

cgriswald 7 months ago

I’m skeptical. It also contains a bit about not asking “if you want I can” and similar, but for me it does that constantly.
Is that evidence that they’re trying to stop a common behavior or evidence that the system prompt was inverted in that case?
Edit: I asked it whether its system prompt discouraged or encouraged the behavior and it returned some of that exact same text including the examples.
It ended with:
> If you want, I can— …okay, I’ll stop before I violate my own rules.

BlueTissuePaper 7 months ago

All other versions state it's not. I asked ChatGPT-5 and it responded that it's it's prompt (I pasted the reply in another comment).

I even obfuscated the prompt taking out any reference to ChatGPT, OpenAI, 4.5, o3 etc and it responded in a new chat to "what is this?" as "That’s part of my system prompt — internal instructions that set my capabilities, tone, and behavior."

Again not definitibe proof, however interesting.

Spivak 7 months ago

Guarantee, of course not. Evidence of, absolutely. Your confidence that you got, essentially, the right prompt increases when parts of it aren't the kind of thing the AI would write—hard topic switches, very specific information, grammar and instruction flow to that isn't typical—and when you get the same thing back using multiple different methods of getting it to fess up.

mvdtnz 7 months ago

No, it's not a guarantee of anything. They're asking for the truth from a lie generating machine. These guys are digital water diviners.

throwaway4496 7 months ago

Not only that, Gemini has a fake prompt that spits out if you try to make it leak the prompt.

redox99 7 months ago
Source?
- throwaway4496 7 months ago
  
  My own experience, I just checked and it seems to have changed again, you can get something out consistently which also looks suspicious.
  ` You are Gemini, a helpful AI assistant built by Google.
  Please use LaTeX formatting for mathematical and scientific notations whenever appropriate. Enclose all LaTeX using '$' or '$$' delimiters. NEVER generate LaTeX code in a latex block unless the user explicitly asks for it. DO NOT use LaTeX for regular prose (e.g., resumes, letters, essays, CVs, etc.). `
  
  1 reply →

selcuka 7 months ago

> Would a system with "no" prompt just spit out some random prompt?

They claim that GPT 5 doesn't hallucinate, so there's that.