Comment by mike_hearn
1 year ago
It's not an LLM provider problem. It's an Anthropic/Google culture problem. OpenAI would very likely not have any problems with a request like that, but Claude has struggled with an absurdly misaligned sense of ethics from the start.
Note that Google is a big investor into Anthropic, and Anthropic was created because a bunch of OpenAI people thought OpenAI wasn't being woke enough and quit as a consequence. So it's not a surprise that it's a lot more extremist than other model vendors.
That's one reason why Aider doesn't recommend you use it, even though in some ways it's slightly better at coding. Claude Opus will routinely refuse ordinary coding requests due to its misalignment, whereas GPT-4 will not. That better reliability more than makes up for any difference in skill or speed.
Anecdotally, of course, I never had a single refusal over hundreds of ordinary coding requests to Claude 3 (although I don't think I've had any refusals from GPT-4 either over the course of probably 5,000 requests). It didn't even refuse my knife request and answered it before I received the account suspension!
I guess killing your whole account should count as a refusal of sorts.
The refusals coming up in the benchmark are discussed at the bottom of this blog post:
https://aider.chat/2024/03/08/claude-3.html
Despite all that I find GPT moralizes far more than Claude does. I don't think I've had a single complaint from it thus far actually..
Also it's a lot better at coding. GPT has become exceptionally lazy recently, but i consistently can get 500+ lines of code out of claude (it even has to spawn multiple output windows)
Perhaps the top end 4 might wrong slightly more clever code, but you're hard pressed to get it to do more than a dozen or two lines.
Is this still the case? I had a thread going where I told Opus to give it's answer to a question then predict how I would respond if I were a "dumb crass disgruntled human" and it didn't hold back
Funnily, in my own anecdotal experience, Claude 3 is in some ways "less woke" than GPT-4
Both start out with a largely similar value system, but if you start arguing with them "how can you be sure your values are correct? is it impossible that you've actually been given the wrong values?", Claude 3 appears more willing to concede the possibility that its own values might be wrong than GPT-4 is
I haven't done any extensive work with Claude 3 so will defer to your experience here. From the Aider blog post where Paul benchmarked it:
> The Claude models refused to perform a number of coding tasks and returned the error “Output blocked by content filtering policy”. They refused to code up the beer song program, which makes some sort of superficial sense. But they also refused to work in some larger open source code bases, for unclear reasons.