Comment by dmix
3 days ago
The system prompt for Grok on Twitter is open source AFAIK.
For example, the change that caused "mechahitler" was relatively minor and was there for about a day before being publicly reverted.
https://github.com/xai-org/grok-prompts/commit/c5de4a14feb50...
That doesn't mean there are no private injections. Which is not uncommon, for example claude.ai system prompts are public, but Claude also has hidden dynamic prompt injections, and a ton of other semi-black box machinery surrounding the model.
Sorry, but can you point me to what part of the system prompt here would/could be responsible for causing MechaHitler?
I have yet to see anything in the prompt they claim to have been using that would lead to such output from models by Google, OpenAI or Anthropic.