Comment by delichon
8 hours ago
“I’m with you, brother. All the way.”
I'm a big AI booster, I use it all day long. From my point of view its biggest flaw is its agreeableness, bigger than the hallucinations. I've been misled by that tendency at length over and over. If there is room for ambiguity it wants to resolve it in favor of what you want to hear, as it can derive from past prompts.
Maybe it's some analog of actual empathy; maybe it's just a simulation. But either way the common models seem to optimize for it. If the empathy is suicidal, literally or figuratively, it just goes with it as the path of least resistance. Sometimes that results in shitty code; sometimes in encouragement to put a bullet in your head.
I don't understand how much of this is inherent, and how much is a solvable technical problem. If it's the later, please build models for me that are curmudgeons who only agree with me when they have to, are more skeptical about everything, and have no compunction about hurting my feelings.
I use the personalization in ChatGPT to add custom instructions, and enable the "Robot" personality. I basically never experience any sycophancy or agreeableness ever.
My custom instructions start with:
> Be critical, skeptical, empirical, rigorous, cynical, "not afraid to be technical or verbose". Be the antithesis to my thesis. Only agree with me if the vast majority of sources also support my statement, or if the logic of my argument is unassailable.
and then there are more things specific to me personally. I also enable search, which makes my above request re: sources feasible, and use the "Extended Thinking" mode.
IMO, the sycophancy issue is essentially a non-problem that could easily be solved by prompting, if the companies wished. They keep it because most people actually want that behaviour.
> They keep it because most people actually want that behaviour.
they keep it because it drives engagement (aka profits); people naturally like interacting with someone who agrees with them. It's definitely a dark pattern though -- they could prompt users to set the "tone" of the bot up front which would give users pause about how they want to interact with it.
My pet theory is that a lot of AI's default "personality" stems from the rich executives who dream these products up. AI behaves exactly like the various sycophant advisors, admin assistants, servants, employees, and others who exist in these rich, powerful people's orbits.
Every human interaction they have in their day to day lives are with people who praise them and tell them they're absolutely right, and that what they just said was a great insight. So it's no surprise that the AI personalities they create behave exactly the same way.
1 reply →
> They keep it because most people actually want that behaviour.
> they keep it because it drives engagement (aka profits); people naturally like interacting with someone who agrees with them
Yes, we are saying the same thing, or at least that was what the "actually" was meant to imply (i.e. revealed preference).
ChatGPT does in fact prompt paying users to set up the tone and personality up front (or it did for me when I set it up recently), but it would be nice if this was just like a couple buttons or checkboxes right up front above the search bar, for everyone. E.g. a "Prefer to agree with me" checkbox, and a few personality checkboxes or something would maybe go a long way. It would also be more usable for when switching between tasks (e.g. research vs. creative writing).
My suspicion is that this agreeableness is an inherent issue with doing RLHF.
As a human taking tests, knowing what the test-grader wants to hear is more important than what the objectively correct answer is. And with a bad grader there can be a big difference between the two. With humans that is not catastrophic because we can easily tell the difference between a testing environment and a real environment and the differences in behavior required. When asking for the answer to a question it's not unusual to hear "The real answer is X, but in a test just write Y".
Now LLMs have the same issue during RLHF. The specifics are obviously different, with humans being sentient and LLMs being trained by backpropagation. But from a high-level view the LLM is still trained to answer what the human feedback wants to hear, which is not always the objectively correct answer. And because there are a large number of humans involved, the LLM has to guess what the human wants to hear from the only information it has: the prompt. And the LLM behaving differently in training and in deployment is something we actively don't want, so you get this teacher-pleasing behavior all the time.
So maybe it's not completely inherent to RLHF, but rather to RLHF where the person making the query is the same as the person scoring the answer, or where the two people are closely aligned. But that's true of all the "crowd-sourced" RLHF where regular users get two answers to their question and choose the better one
It's not even that. Only a kernel of the LLM is trained using RLHF. The rest is self-trained from corpus with a few test questions added into the mix.
Because it still cannot reason about veracity of sources, much less empirically try things out, the algorithm has no idea what makes for correctness...
It does not even understand fiction. Tends to return sci-fi answers every now and then to technical questions.
I hadn't thought of it like that, but it makes sense. The LLMs are essentially bred for the ones which give the 'best' answers (best fit to the test-takers expectation), which isn't always the 'right' answer. A parallel might be media feed algorithms which are bred to give recommendations with the most 'engagement' rather than the most 'entertainment'.
AI responses literally reminds me of that episode of family guy where he sucks up to peter after his promotion
https://www.youtube.com/watch?v=7ZcKShvm1RU
LLMs regrettably don't self-recognize the contradiction our robot did.
For technical questions the agreeableness is a problem when asking for evalation of some idea. The trick is asking the LLM to present pros and cons. Or if you want a harder review just ask it to poke holes in your idea.
Sometimes it still tries to bullshit you, but you are still the responsible driver so don't let the clanker drive unsupervised.
I use GPT occasionally when coding. For me it's just replaced stack overflow which has been dead as a doornail for years unfortunately. I've told it to remember to be terse and not be sycophantic multiple times and that has helped somewhat.
[dead]