← Back to context

Comment by simonw

3 days ago

I was thinking it would actually be really interesting to take the Grok system prompt that was running when it went MechaHitler and try that (and a bunch of nasty prompts) against different models to see what happens.

Yes, and I wonder if the recent research about "emergent misalignment" might be somehow related?

Well, it didn't really go MechaHitler. It was prompted with a question if it would rather be MechaHitler or GigaJew. The way LLMs and temperatures work you can reroll the answer and get either.