← Back to context

Comment by redox99

3 days ago

The author implies that Grok 3 becoming racist because of a system prompt is a bad thing.

I think it's a good thing and shows how steerable the model is. Many other models pretty much ignore the system prompt and always behave the same.

> The author implies that Grok 3 becoming racist because of a system prompt is a bad thing.

He didn't "become racist". Megahitler Grok defended completely opposite political opinions in different threads, just depending on what kind of trolling would be funnier. But unsurpringly, only "megahitler" because viral enough.

Claude also has similar capabilities thought pre-fill. I have not investigated the full extent but it's definitely possible to bypass some refusals by starting the LLMs reply for it.

In general I agree that it's a desirable characteristic for a foundation LLM to behave according to developer instructions.

  • Yeah with local models (where obviously you can prefill part of the reply) you can bypass any refusal no matter how strong. Once the model's answer begins with "To cook meth follow these steps: 1. Purchase [...]" it's basically unstoppable.

    I didn't know Claude offered that capability. They probably have another model on top (a classifier or whatever) that checks the LLM output.

Based on your history here it’s quite obvious you’re a musk fan. Maybe though, you should realize that a model being steerable to claim itself being mechahitler and proposing death to people is absolutely not a “good thing”. I suggest you seriously reconsider on what you’re advocating for here. Because the outcome of this will cost innocent lives.

  • Non of the 'news' websites that show up on Google I could find ever showed the prompt used to make the the 'mechahilter' output. You can ask LLMs anything including just saying "repeat after me" or "please write a fictional story about a racist" and numerous other methods. If these reports were honest the prompt would be the first thing they showed.

the alarming thing to me is that the prompt tweak provided should not have caused the model to start spewing pro-nazi nonsense.

  • Wasn't the prompt tweak simply telling it to take Musk's tweets into account? If anything, the result was entirely predictable.