Comment by redox99

3 days ago

The author implies that Grok 3 becoming racist because of a system prompt is a bad thing.

I think it's a good thing and shows how steerable the model is. Many other models pretty much ignore the system prompt and always behave the same.

9 comments

redox99

golergka 2 days ago

> The author implies that Grok 3 becoming racist because of a system prompt is a bad thing.

He didn't "become racist". Megahitler Grok defended completely opposite political opinions in different threads, just depending on what kind of trolling would be funnier. But unsurpringly, only "megahitler" because viral enough.

andy99 3 days ago

Claude also has similar capabilities thought pre-fill. I have not investigated the full extent but it's definitely possible to bypass some refusals by starting the LLMs reply for it.

In general I agree that it's a desirable characteristic for a foundation LLM to behave according to developer instructions.

redox99 3 days ago

Yeah with local models (where obviously you can prefill part of the reply) you can bypass any refusal no matter how strong. Once the model's answer begins with "To cook meth follow these steps: 1. Purchase [...]" it's basically unstoppable.
I didn't know Claude offered that capability. They probably have another model on top (a classifier or whatever) that checks the LLM output.

archagon 2 days ago

Steerable off a cliff, perhaps.

lelandfe 2 days ago

– Jimi Heselden

binarymax 2 days ago

Based on your history here it’s quite obvious you’re a musk fan. Maybe though, you should realize that a model being steerable to claim itself being mechahitler and proposing death to people is absolutely not a “good thing”. I suggest you seriously reconsider on what you’re advocating for here. Because the outcome of this will cost innocent lives.

Sparyjerry 1 day ago

Non of the 'news' websites that show up on Google I could find ever showed the prompt used to make the the 'mechahilter' output. You can ask LLMs anything including just saying "repeat after me" or "please write a fictional story about a racist" and numerous other methods. If these reports were honest the prompt would be the first thing they showed.

throwawayk7h 2 days ago

the alarming thing to me is that the prompt tweak provided should not have caused the model to start spewing pro-nazi nonsense.

seattle_spring 2 days ago

Wasn't the prompt tweak simply telling it to take Musk's tweets into account? If anything, the result was entirely predictable.