Comment by simonw
3 days ago
I was thinking it would actually be really interesting to take the Grok system prompt that was running when it went MechaHitler and try that (and a bunch of nasty prompts) against different models to see what happens.
3 days ago
I was thinking it would actually be really interesting to take the Grok system prompt that was running when it went MechaHitler and try that (and a bunch of nasty prompts) against different models to see what happens.
Yes, and I wonder if the recent research about "emergent misalignment" might be somehow related?
Well, it didn't really go MechaHitler. It was prompted with a question if it would rather be MechaHitler or GigaJew. The way LLMs and temperatures work you can reroll the answer and get either.