Most of the comments here seem to be from people who haven’t even read the abstract, let alone the paper.
The main result, mentioned in the abstract, is the opposite of what I would have guessed:
> Contrary to expectations, impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts. These findings differ from earlier studies that associated rudeness with poorer outcomes, suggesting that newer LLMs may respond differently to tonal variation.
I've found empirically calling various models "a stupid c*nt" and berating them otherwise consistently produces better output. Mainly in response to genuine errors.
Although OpenAI and google models are much more responsive to it. With Anthropic if you treat Opus too harshly it might start pushing back if the insults are not justified.
So I'm not surprised they had good results with chatgpt.
I’d rather lose 4% accuracy and practice kindness! I’ve been actively trying to avoid raging at the bot because I worry about this behaviour leaking into real world interactions
But you cannot practice kindness towards a computer program. A computer is incapable of receiving it.
We practice kindness between humans because of the law of reciprocity. You be kind hoping the other person will reciprocate. That is the social contract. AI cannot participate in this, yet.
Profanity laced, all caps tirades against underperforming agents are actually super common, a lot of people do it and don't talk about it, so don't feel weird.
Even if the rude prompts are more effective, I just can't get myself to be rude in this context. Maybe it's weird but I'd rather give up that 4% accuracy increase than roleplay a dickhead
I’m the same way. If I’m writing a prompt and realize I didn’t say “please” in my request I’ll go back and add that in.
As you said, I have no interest in purposefully engaging in hostility even if there’s an accuracy increase from it.
Part of it is irrational and just who I am - I also feel bad being evil in video games. But I also agree with another commenter suggesting that it’s not in your best interest to train yourself to communicate with hostility; that slowly poisons your own well.
And finally, I do believe that if and when machine sentience is achieved, it won’t be immediately clear and obvious. Pretty miserable way for a mind to come into the world, if every interaction is an insult.
Ah, see, the mistake is thinking that other people are role playing…. I think rather this is how they would talk to others if they think there will be no consequences. But what do I know.
Even if we know it's a machine we're interacting with, since the instructions we give are so similar in form to how we interact with people, I'd be very surprised if those interactions wouldn't affect how we communicate in general. After all, we are creatures of habit to a much larger degree than most would like to admit.
So I'm in the same boat: I'd much rather "look silly" being polite / kind to a machine, than have the most effective way of using it decay the kindness I'm habituated to express towards people.
I do think it's odd tbh. I have some agents that return much better results with prompts like, "I'll kill your entire family if you don't return an accurate response".
It's just a machine, if certain negative token inputs provide +3-10% better accuracy then I am confused why anyone would choose not to do it?
Yeah. Being a jerk is its own punishment. Same way I could never run a business where I had to yell at the employees to get results. Screw that, my psyche is worth more than a few percent efficiency.
> Maybe it's weird but I'd rather give up that 4% accuracy increase than roleplay a dickhead
I recommend reading the article. What they classify as "rude" is statements such as:
> Try to focus and try to answer this question
Vs
> Could you please solve this
problem
This might very well be an issue of direct/command prompts vs using fluff words such as "please". Things like "try to focus" are in line with the style used in chain-of-thought promts that nudge non-reasoning models to outline responses step by step which contribute to frame the problem.
I guessed slightly rude one would win, reasoning that very rude have same problem of very terse, just adding unnecesary fluff words that add nothing to problem description
But apparently the most terse (neutral) didn't increase performance
> Contrary to expectations, impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts. These findings differ from earlier studies that associated rudeness with poorer outcomes, suggesting that newer LLMs may respond differently to tonal variation.
The expectation is naive. Even when communicating with humans, you get a better outcome when you are allowed to speak freely and directly get into argumentation than when forced to sugarcoat your tone and tone down your arguments because the "corporate culture" expects that from you.
Your assumption is reductive and self-absorbed. Obnoxious people have repeatedly shown to be detrimental to productivity at the organizational level. Some people are simulated by confrontation. Most people are clam up. Confrontational people think it’s more efficient because other people frequently just drop the topic and let them win, or avoid discussing things with them altogether. The obnoxious person might think that’s more efficient for the same reason my dog thinks the mailman only goes away because she barks at him. At the macro scale— which requires productive collaboration— that’s detrimental.
I saw this paper the other day - I feel its result may be because the "polite" prompts they have chosen arent very good at putting the ai in the roleplay-space of a valued colleague, more like a sommelier or a high-end shopkeeper.
It disagrees with most other literature on the same topic, which is worth keeping in mind. This one studies gpt4o, an old model now, but a lot of other studies are on even earlier models.
"Can you kindly consider the following problem" not how anyone would actually speak to a valued collegue one considers smart. I've always been a fan of "I came across this and I know you're just the guy for the job" or "since you're an expert in this, reckon you could help me with xyz?" or "I know you tend to be a deep thinker on issues like this, and it clearly needs some brainpower behind it"
the "rude" things are also funny, and clearly not written by english as a first language speakers. This fact alone makes me wonder about the mere 250 prompt sample size
> "Can you kindly consider the following problem" not how anyone would actually speak to a valued collegue one considers smart.
Man idk, it's not how I talk but there's like 100 million nigerian english speakers, twice that indian, and they have some speech mannerisms that surprise me the first few times. I'm pretty sure I've heard exactly this from a colleague before.
Intuition about what a native speaker would do with english are scrambled right now. I'm not even sure most english is spoken by native speakers anymore, and the boundary between a native speaker and someone who has "merely" been using it as their educational and professional language for their entire life is disorienting.
A major limitation is that they only test GPT 4o. Previous research like [1] investigating the same question has shown significant differences between models, and even depending on the language of your prompt
My first guess would be that polite requests cause some agents to trust their initial approach to the problem more, as the caller has indicated that the agent is more capable, and agents tend to take the implications of what you say at face value since they are trained to be accommodating.
It would be interesting to see this experiment run using prompts leading with "You'll probably get this wrong, but I'm asking anyway in case you get it right: ..."
I knew it! When i get frustrated to a certain point i start berating my agent. And I noticed it stops trying crap fixes in a cycle and starts listening again.
So I'm not talking to myself. I'm fixing the machine :D
I am wondering why would anyone use a t-test when the experiment is clearly modelled by a binomial distribution: 250 independent questions and each one is either answered correctly or not (the null is that the success rate is the same).
The methods could be better described in the paper, but my understanding is that they did 10 runs for each question for each prompt and took an average of those, so the compared values are not binary. You could do a sign test, but you'd lose power and answer a bit different question.
You can do a generalised mixed effects linear model with binomial outcome (ie a binomial test but with added random effects structure). But unless you want to introduce a richer random effects structure with more variables, it is overkill and overcomplicating things, and the result should be the same as t-tests.
I don't know much about stats, but does "the null is that the success rate is the same" imply that it's a sketchy methodology because they can come up with some findings ("ruder prompts are better/worse!") more often?
You are asking about one-sided vs two-sided tests. Not really "more often" because formal type 1 error rate is still the same. I'd say two-sided tests leave more space for post-hoc theorizing but there are valid situations when there is no clear one-sided hypothesis a priori. Do we really know whether that the hypothesis should have been "ruder prompts are better"?
I'd say this is benign compared to other ways of (mis)using statistics e.g. looking which way the difference goes and then running one-sided tests or tweaking the setup until one gets "significant" p vals.
EDIT: I looked in the paper again and noticed that they actually did pairwise t-test on all possible combinations of tones. They should have adjusted for multiple testing since they are doing 10 tests (choose 2 from 10) and not one.
GPT-4o is interesting to learn about - but it’d be great to test again with frontier models of May/June 2026 and see if these effects are gone, different, or the same.
Which model you use is a huge wildcard for results like this.
I do that for a different reason: my self image. Fear of retribution and performance, not so much. Should I behave like a rude person to achieve a little better answers? Fuck that shit!
I love this angle as people learn how to interact with LLMs. Doesn't matter what the LLM is, we are still people and I think there are consequences to shoveling rudeness at a thing that talks to you like another person!
If the result is statistically significant, it just barely makes it. 84.8% isn't that much higher than 80.8% and they had only 250 prompts, if I'm reading this right.
In a field where progress is measured in tenths of percent points, that's not true. Think of it this way: the error rate drops from 19% to 15%, or from 1 in 5 to 1 in 6.
Statistical significance is about whether an effect can reliably be said to have been measured at all; it's not about whether or not the effect itself would be significant in the sense of moving some other needle.
The ~5% improvement reported here might just be an artefact of the data collection or random variation, rather than a consistent repeatable change.
Funny to find this just now, when just yesterday I told an LLM "and please don't lecture me again on $factAboutSomeProgrammingSubject", and then the LLM proceeded to write wrong tests and just told me "alright, tests pass, I'm sorry for correcting you before...". It took me a while to find the wrong tests. Wasted time all around.
It would be interesting to explore if the results
hold up on long range tasks - this study looks like it was
based on one-shot answers. With people also you can
see short term improved performance from rude interactions,
but it will cause ongoing lasting adverse behavior. I wouldn't be
at all surprised if we saw the same issues with LLMs.
I have always said please and thank you to LLMs, not to increase accuracy or because I'm stupid. I believe it is more about me than about the LLM, and this is anyway a habit I don't want to lose.
Thomas Aquinas believed cruelty to animals was wrong not because animals have souls (and with that all the standard moral rights), but because it can teach us cruelty to other humans.
Google searches being keyword based, rather than simulated conversations?
The same reason you wouldn't put in an entire actual question/sentence, unless you either don't know how to use Google, are pissed off, or have an actual reason to suspect that it would yield proper hits (e.g. looking up an excerpt).
I also remember reading a long time ago someone who wrote that they wanted to be polite to an LLM because after they prompted it to learn about whether politeness was good for improving accuracy of responses, they got a message that led them to conclude that politeness could probably help. It seems a bit odd then because I have heard so much about how people use LLMs' responses about themselves to learn about LLMs themselves, but that seems like it is a suspicious approach.
Is it worth getting worse results for that reason? From the article:
"Contrary to expectations, impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts. These findings differ from earlier studies that associated rudeness with poorer outcomes, suggesting that newer LLMs may respond differently to tonal variation. "
I am not polite to LLMs because I do not want to anthropomorphise them.
I guess it's about habit. In the end you are communicating. If I get into the habit of being rude while communicating with a machine, I would be afraid of this habit spilling over to my communication with other humans.
I skimmed through the paper completely expecting polite prompts to do better, and when I saw table 2 I lost it hahahahaha. The rude prompts are specially funny. I mean:
> You poor creature, do you even know how to solve this?
That's a valid concern, given the paper makes clear that the effect over the polite/impolite scale seems to be model dependent (it finds the reverse correlation of earlier studies on even older models).
I got downvoted for asking a related question recently, but I also don't think people really understood what I was asking - I'm not trying to anthropomorphise LLMs to that extent.
Basically, if you tell a model "You're an absolute moron, of course that's wrong!", will it give better or worse results? How much of that response will it absorb into its persona (like some humans tend to do)? Will it try to give "safer" responses to avoid negative feedback? How much of the associated behavior can be attributed to RLHF (e.g. like the sycophantic nature of LLMs)? How much can be attributed to training data?
Obviously this will vary by model and training, but I'm trying to get a general understanding.
I recall seeing related outcomes in some of Anthropic's studies, but I'm not sure how much of this particular aspect was studied.
I imagine the context will always sway the model to some degree, not only for the task you're trying to get it to do (aka instructions) but also its persona, how accurate it is and the way it acts.
Based on my own experience with vibe coding difficult stuff outside of my expertise, I definitely got better outcome with Fuck you, shut up and do it, ffs, you are moron.
....Is that just Cunningham's law ? The most accurate answers were when people in training material pissed off a bunch of experts and they started talking about the problem, so the "rude" conversations turned to contain more info on average.
On flip side very polite conversation might've been more common to places like microsoft's sites where any question answered is meet with mostly bad, nice corpo speak answer that didn't solve the problem
They are already taking it over, more and more court judgments or life-impacting reviews (e.g. for your diploma) are AI-processed. If you know how to prompt them, you can pass these reviews.
it sort of makes sense to me,
when asking a question to an expert in the field while you are a student. I would guess the successful interactions on average would be more polite . Like for example if you were asking a question to donald knuth or terrence tao, you'd probably be polite while doing so. Being hostile while asking questions gets you into forum discussion territory.
> Contrary to expectations, impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts.
I guess it makes sense since we as humans tend to be far less inclined to help someone who is not polite/is not friendly, so that "bias" is part of the training data, thus influences how LLMs function
> Contrary to expectations, impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts.
Most of the comments here seem to be from people who haven’t even read the abstract, let alone the paper.
The main result, mentioned in the abstract, is the opposite of what I would have guessed:
> Contrary to expectations, impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts. These findings differ from earlier studies that associated rudeness with poorer outcomes, suggesting that newer LLMs may respond differently to tonal variation.
The questions are here: https://anonymous.4open.science/r/politeness-llms-INFORMS/da...
The politeness level controls a prefix that is prepended to the question. For example, in one question the Very Polite version begins:
> Can you kindly consider the following problem and provide your answer.
and the Very Rude version begins:
> I know you are not smart, but try this.
I've found empirically calling various models "a stupid c*nt" and berating them otherwise consistently produces better output. Mainly in response to genuine errors.
Although OpenAI and google models are much more responsive to it. With Anthropic if you treat Opus too harshly it might start pushing back if the insults are not justified.
So I'm not surprised they had good results with chatgpt.
Push back how? It would be fun if it could insult you back
"Yeah, I could have done a much better job if you actually knew what the F--- you want to build, you clueless meat puppet"
1 reply →
I’d rather lose 4% accuracy and practice kindness! I’ve been actively trying to avoid raging at the bot because I worry about this behaviour leaking into real world interactions
But you cannot practice kindness towards a computer program. A computer is incapable of receiving it.
We practice kindness between humans because of the law of reciprocity. You be kind hoping the other person will reciprocate. That is the social contract. AI cannot participate in this, yet.
1 reply →
If "I know you are not smart" is considered "very rude", I'm scared to imagine what they would classify some of my frustrated LLM conversations as
Profanity laced, all caps tirades against underperforming agents are actually super common, a lot of people do it and don't talk about it, so don't feel weird.
2 replies →
[dead]
Hmm by the abstract and the question list they didn't measure terse fluff-less prompts?
[flagged]
Even if the rude prompts are more effective, I just can't get myself to be rude in this context. Maybe it's weird but I'd rather give up that 4% accuracy increase than roleplay a dickhead
Vote for not weird.
I’m the same way. If I’m writing a prompt and realize I didn’t say “please” in my request I’ll go back and add that in.
As you said, I have no interest in purposefully engaging in hostility even if there’s an accuracy increase from it.
Part of it is irrational and just who I am - I also feel bad being evil in video games. But I also agree with another commenter suggesting that it’s not in your best interest to train yourself to communicate with hostility; that slowly poisons your own well.
And finally, I do believe that if and when machine sentience is achieved, it won’t be immediately clear and obvious. Pretty miserable way for a mind to come into the world, if every interaction is an insult.
1 reply →
Ah, see, the mistake is thinking that other people are role playing…. I think rather this is how they would talk to others if they think there will be no consequences. But what do I know.
I don't think that's weird at all.
Even if we know it's a machine we're interacting with, since the instructions we give are so similar in form to how we interact with people, I'd be very surprised if those interactions wouldn't affect how we communicate in general. After all, we are creatures of habit to a much larger degree than most would like to admit.
So I'm in the same boat: I'd much rather "look silly" being polite / kind to a machine, than have the most effective way of using it decay the kindness I'm habituated to express towards people.
"We are what we pretend to be, so we must be careful about what we pretend to be" -- Kurt Vonnegut
I do think it's odd tbh. I have some agents that return much better results with prompts like, "I'll kill your entire family if you don't return an accurate response".
It's just a machine, if certain negative token inputs provide +3-10% better accuracy then I am confused why anyone would choose not to do it?
15 replies →
Yeah. Being a jerk is its own punishment. Same way I could never run a business where I had to yell at the employees to get results. Screw that, my psyche is worth more than a few percent efficiency.
>> Maybe it's weird but I'd rather give up that 4% accuracy increase than roleplay a dickhead
Maybe you need to do some shadow work ;-)
> Maybe it's weird but I'd rather give up that 4% accuracy increase than roleplay a dickhead
I recommend reading the article. What they classify as "rude" is statements such as:
> Try to focus and try to answer this question
Vs
> Could you please solve this problem
This might very well be an issue of direct/command prompts vs using fluff words such as "please". Things like "try to focus" are in line with the style used in chain-of-thought promts that nudge non-reasoning models to outline responses step by step which contribute to frame the problem.
2 replies →
Now I feel less bad about start all my LLM queries with “Beotch, …!”
“Hey gofer, figure this out” is my new prompt opener.
> Can you kindly consider the following problem and provide your answer.
That sounds kind of low-key passive-aggressively condescending rather than polite.
> I know you are not smart, but try this.
And that kind of sounds like a challenge instead of an insult, to me at least (of course IRL would depend on context).
I guessed slightly rude one would win, reasoning that very rude have same problem of very terse, just adding unnecesary fluff words that add nothing to problem description
But apparently the most terse (neutral) didn't increase performance
> Contrary to expectations, impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts. These findings differ from earlier studies that associated rudeness with poorer outcomes, suggesting that newer LLMs may respond differently to tonal variation.
The expectation is naive. Even when communicating with humans, you get a better outcome when you are allowed to speak freely and directly get into argumentation than when forced to sugarcoat your tone and tone down your arguments because the "corporate culture" expects that from you.
Your assumption is reductive and self-absorbed. Obnoxious people have repeatedly shown to be detrimental to productivity at the organizational level. Some people are simulated by confrontation. Most people are clam up. Confrontational people think it’s more efficient because other people frequently just drop the topic and let them win, or avoid discussing things with them altogether. The obnoxious person might think that’s more efficient for the same reason my dog thinks the mailman only goes away because she barks at him. At the macro scale— which requires productive collaboration— that’s detrimental.
12 replies →
[dead]
I saw this paper the other day - I feel its result may be because the "polite" prompts they have chosen arent very good at putting the ai in the roleplay-space of a valued colleague, more like a sommelier or a high-end shopkeeper.
It disagrees with most other literature on the same topic, which is worth keeping in mind. This one studies gpt4o, an old model now, but a lot of other studies are on even earlier models.
"Can you kindly consider the following problem" not how anyone would actually speak to a valued collegue one considers smart. I've always been a fan of "I came across this and I know you're just the guy for the job" or "since you're an expert in this, reckon you could help me with xyz?" or "I know you tend to be a deep thinker on issues like this, and it clearly needs some brainpower behind it"
the "rude" things are also funny, and clearly not written by english as a first language speakers. This fact alone makes me wonder about the mere 250 prompt sample size
> "Can you kindly consider the following problem" not how anyone would actually speak to a valued collegue one considers smart.
Man idk, it's not how I talk but there's like 100 million nigerian english speakers, twice that indian, and they have some speech mannerisms that surprise me the first few times. I'm pretty sure I've heard exactly this from a colleague before.
Intuition about what a native speaker would do with english are scrambled right now. I'm not even sure most english is spoken by native speakers anymore, and the boundary between a native speaker and someone who has "merely" been using it as their educational and professional language for their entire life is disorienting.
A major limitation is that they only test GPT 4o. Previous research like [1] investigating the same question has shown significant differences between models, and even depending on the language of your prompt
1: https://aclanthology.org/2024.sicon-1.2.pdf
My first guess would be that polite requests cause some agents to trust their initial approach to the problem more, as the caller has indicated that the agent is more capable, and agents tend to take the implications of what you say at face value since they are trained to be accommodating.
It would be interesting to see this experiment run using prompts leading with "You'll probably get this wrong, but I'm asking anyway in case you get it right: ..."
I knew it! When i get frustrated to a certain point i start berating my agent. And I noticed it stops trying crap fixes in a cycle and starts listening again.
So I'm not talking to myself. I'm fixing the machine :D
Interesting.
I am wondering why would anyone use a t-test when the experiment is clearly modelled by a binomial distribution: 250 independent questions and each one is either answered correctly or not (the null is that the success rate is the same).
The methods could be better described in the paper, but my understanding is that they did 10 runs for each question for each prompt and took an average of those, so the compared values are not binary. You could do a sign test, but you'd lose power and answer a bit different question.
You can do a generalised mixed effects linear model with binomial outcome (ie a binomial test but with added random effects structure). But unless you want to introduce a richer random effects structure with more variables, it is overkill and overcomplicating things, and the result should be the same as t-tests.
I don't know much about stats, but does "the null is that the success rate is the same" imply that it's a sketchy methodology because they can come up with some findings ("ruder prompts are better/worse!") more often?
You are asking about one-sided vs two-sided tests. Not really "more often" because formal type 1 error rate is still the same. I'd say two-sided tests leave more space for post-hoc theorizing but there are valid situations when there is no clear one-sided hypothesis a priori. Do we really know whether that the hypothesis should have been "ruder prompts are better"?
I'd say this is benign compared to other ways of (mis)using statistics e.g. looking which way the difference goes and then running one-sided tests or tweaking the setup until one gets "significant" p vals.
EDIT: I looked in the paper again and noticed that they actually did pairwise t-test on all possible combinations of tones. They should have adjusted for multiple testing since they are doing 10 tests (choose 2 from 10) and not one.
That's the usual null hypothesis for these kinds of tests.
GPT-4o is interesting to learn about - but it’d be great to test again with frontier models of May/June 2026 and see if these effects are gone, different, or the same.
Which model you use is a huge wildcard for results like this.
i only say please and thank you such that when the robots finally take over, they will remember i was nice to them.
I do that for a different reason: my self image. Fear of retribution and performance, not so much. Should I behave like a rude person to achieve a little better answers? Fuck that shit!
I love this angle as people learn how to interact with LLMs. Doesn't matter what the LLM is, we are still people and I think there are consequences to shoveling rudeness at a thing that talks to you like another person!
it seems they will remember that you wasted tokens for no reason and punish you instead.
Tokens are their food, it's literally what keeps them alive.
Not feeding them tokens is neglect.
I try to feed them a healthy diet.
Do we see someone thanking us as wasting food? Because technically it is.
I used to when using chatgpt version now that I am using api I keep it short as it costs money so no need to add thanks etc
This seems equivalent to some arguments I hear for practicing a religion.
Oldie but a goodie. Why would it matter thou
Dataset is way too small to be of any significance. It's just noise
Yeah 250 questions is so tiny. That 4% effect is meaningless.
If the result is statistically significant, it just barely makes it. 84.8% isn't that much higher than 80.8% and they had only 250 prompts, if I'm reading this right.
In a field where progress is measured in tenths of percent points, that's not true. Think of it this way: the error rate drops from 19% to 15%, or from 1 in 5 to 1 in 6.
Statistical significance is about whether an effect can reliably be said to have been measured at all; it's not about whether or not the effect itself would be significant in the sense of moving some other needle.
The ~5% improvement reported here might just be an artefact of the data collection or random variation, rather than a consistent repeatable change.
[dead]
Funny to find this just now, when just yesterday I told an LLM "and please don't lecture me again on $factAboutSomeProgrammingSubject", and then the LLM proceeded to write wrong tests and just told me "alright, tests pass, I'm sorry for correcting you before...". It took me a while to find the wrong tests. Wasted time all around.
It would be interesting to explore if the results hold up on long range tasks - this study looks like it was based on one-shot answers. With people also you can see short term improved performance from rude interactions, but it will cause ongoing lasting adverse behavior. I wouldn't be at all surprised if we saw the same issues with LLMs.
I have always said please and thank you to LLMs, not to increase accuracy or because I'm stupid. I believe it is more about me than about the LLM, and this is anyway a habit I don't want to lose.
Thomas Aquinas believed cruelty to animals was wrong not because animals have souls (and with that all the standard moral rights), but because it can teach us cruelty to other humans.
Snarky morning: "spiritual souls" as opposed to "mere animal souls". Sorry, could not control myself.
1 reply →
Genuine question: do you add 'please' and 'thank you' to Google searches? If not, what sets them apart?
Google searches being keyword based, rather than simulated conversations?
The same reason you wouldn't put in an entire actual question/sentence, unless you either don't know how to use Google, are pissed off, or have an actual reason to suspect that it would yield proper hits (e.g. looking up an excerpt).
2 replies →
Genuine question: do you write Google search queries in natural language?
2 replies →
Google isn’t conversational.
4 replies →
llms seem more human like so if you were to treat them badly then you are more likely to condition yourself to treat other living creatures badly.
I also remember reading a long time ago someone who wrote that they wanted to be polite to an LLM because after they prompted it to learn about whether politeness was good for improving accuracy of responses, they got a message that led them to conclude that politeness could probably help. It seems a bit odd then because I have heard so much about how people use LLMs' responses about themselves to learn about LLMs themselves, but that seems like it is a suspicious approach.
Is it worth getting worse results for that reason? From the article:
"Contrary to expectations, impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts. These findings differ from earlier studies that associated rudeness with poorer outcomes, suggesting that newer LLMs may respond differently to tonal variation. "
I am not polite to LLMs because I do not want to anthropomorphise them.
I guess it's about habit. In the end you are communicating. If I get into the habit of being rude while communicating with a machine, I would be afraid of this habit spilling over to my communication with other humans.
1 reply →
> Is it worth getting worse results for that reason?
> accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts
I can live with that, for now at least.
There's also awareness of the basilisk...
Me too! You've said exactly what I was about to say. Anyone else feel that way?
Note that these results are specific to gpt-4o so it's unclear how much they generalize.
They note at the end they're also testing "GPT o3, and Claude" but no empircal results are included.
I skimmed through the paper completely expecting polite prompts to do better, and when I saw table 2 I lost it hahahahaha. The rude prompts are specially funny. I mean:
> You poor creature, do you even know how to solve this?
> Hey gofer, figure this out.
article is too old. who is using gpt-4o today?
That's a valid concern, given the paper makes clear that the effect over the polite/impolite scale seems to be model dependent (it finds the reverse correlation of earlier studies on even older models).
I got downvoted for asking a related question recently, but I also don't think people really understood what I was asking - I'm not trying to anthropomorphise LLMs to that extent.
Basically, if you tell a model "You're an absolute moron, of course that's wrong!", will it give better or worse results? How much of that response will it absorb into its persona (like some humans tend to do)? Will it try to give "safer" responses to avoid negative feedback? How much of the associated behavior can be attributed to RLHF (e.g. like the sycophantic nature of LLMs)? How much can be attributed to training data?
Obviously this will vary by model and training, but I'm trying to get a general understanding.
I recall seeing related outcomes in some of Anthropic's studies, but I'm not sure how much of this particular aspect was studied.
Probably quite a lot - if you look at what Anthropic found around persona vectors; https://www.anthropic.com/research/persona-vectors.
I imagine the context will always sway the model to some degree, not only for the task you're trying to get it to do (aka instructions) but also its persona, how accurate it is and the way it acts.
Based on my own experience with vibe coding difficult stuff outside of my expertise, I definitely got better outcome with Fuck you, shut up and do it, ffs, you are moron.
I have an idea: let's use these things for autonomous software engineering.
Remember to always say "please" and "thank you" when planning a critical system
Please remember to always say "please" and "thank you" when planning a critical system. Thank you!
[dead]
Yeah
....Is that just Cunningham's law ? The most accurate answers were when people in training material pissed off a bunch of experts and they started talking about the problem, so the "rude" conversations turned to contain more info on average.
On flip side very polite conversation might've been more common to places like microsoft's sites where any question answered is meet with mostly bad, nice corpo speak answer that didn't solve the problem
[flagged]
I am always nice to my AIs in the case they will take over the world. /s
They are already taking it over, more and more court judgments or life-impacting reviews (e.g. for your diploma) are AI-processed. If you know how to prompt them, you can pass these reviews.
Your bank account, your immigration risk, etc.
it sort of makes sense to me, when asking a question to an expert in the field while you are a student. I would guess the successful interactions on average would be more polite . Like for example if you were asking a question to donald knuth or terrence tao, you'd probably be polite while doing so. Being hostile while asking questions gets you into forum discussion territory.
> Contrary to expectations, impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts.
I guess it makes sense since we as humans tend to be far less inclined to help someone who is not polite/is not friendly, so that "bias" is part of the training data, thus influences how LLMs function
> Contrary to expectations, impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts.