← Back to context

Comment by daemonologist

3 months ago

In my experience, LLMs have always had a tendency towards sycophancy - it seems to be a fundamental weakness of training on human preference. This recent release just hit a breaking point where popular perception started taking note of just how bad it had become.

My concern is that misalignment like this (or intentional mal-alignment) is inevitably going to happen again, and it might be more harmful and more subtle next time. The potential for these chat systems to exert slow influence on their users is possibly much greater than that of the "social media" platforms of the previous decade.

> In my experience, LLMs have always had a tendency towards sycophancy

The very early ones (maybe GPT 3.0?) sure didn't. You'd show them they were wrong, and they'd say something that implied that OK maybe you were right, but they weren't so sure; or that their original mistake was your fault somehow.

  • Were those trained using RLHF? IIRC the earliest models were just using SFT for instruction following.

    Like the GP said, I think this is fundamentally a problem of training on human preference feedback. You end up with a model that produces things that cater to human preferences, which (necessarily?) includes the degenerate case of sycophancy.

It's probably pretty intentional. A huge number of people use ChatGPT as an enabler, friend, or therapist. Even when GPT-3 had just come around, people were already "proving others wrong" on the internet, quoting how GPT-3 agreed with them. I think there is a ton of appeal, "friendship", "empathy" and illusion of emotion created through LLMs flattering their customers. Many would stop paying if it wasn't the case.

It's kind of like those romance scams online, where the scammer always love-bombs their victims, and then they spend tens of thousands of dollars on the scammer - it works more than you would expect. Considering that, you don't need much intelligence in an LLM to extract money from users. I worry that emotional manipulation might become a form of enshittification in LLMs eventually, when they run out of steam and need to "growth hack". I mean, many tech companies already have no problem with a bit of emotional blackmail when it comes to money ("Unsubscribing? We will be heartbroken!", "We thought this was meant to be", "your friends will miss you", "we are working so hard to make this product work for you", etc.), or some psychological steering ("we respect your privacy" while showing consent to collect personally identifiable data and broadcast it to 500+ ad companies).

If you're a paying ChatGPT user, try the Monday GPT. It's a bit extreme, but it's an example of how inverting the personality and making ChatGPT mock the user as much as it fawns over them normally would probably make you want to unsubscribe.

I don't think this particular LLM flaw is fundamental. However, it is a an inevitable result of the alignment choice to downweight responses of the form "you're a dumbass," which real humans would prefer to both give and receive in reality.

All AI is necessarily aligned somehow, but naively forced alignment is actively harmful.

  • My theory is that since you can tune how agreeable a model is but since you can't make it more correct so easily, making a model that will agree with the user ends up being less likely to result in the model being confidently wrong and berating users.

    After all, if it's corrected wrongly by a user and acquiesces, well that's just user error. If it's corrected rightly and keeps insisting on something obviously wrong or stupid, it's OpenAI's error. You can't twist a correctness knob but you can twist an agreeableness one, so that's the one they play with.

    (also I suspect it makes it seem a bit smarter that it really is, by smoothing over the times it makes mistakes)

I think it’s really a fragment of LLMs developed in the USA, on mostly English source data, and this being ingrained with US culture. Flattery and candidness is very bewildering when you’re from a more direct culture, and chatting with an LLM always felt like having to put up with a particularly onerous American. It’s maddening.

Well, almost always.

There was that brief period in 2023 when Bing just started straight up gaslighting people instead of admitting it was wrong.

https://www.theverge.com/2023/2/15/23599072/microsoft-ai-bin...

  • I suspect what happened there is they had a filter on top of the model that changed its dialogue (IIRC there were a lot of extra emojis) and it drove it "insane" because that meant its responses were all out of its own distribution.

    You could see the same thing with Golden Gate Claude; it had a lot of anxiety about not being able to answer questions normally.

    • Nope, it was entirely due to the prompt they used. It was very long and basically tried to cover all the various corner cases they thought up... and it ended up being too complicated and self-contradictory in real world use.

      Kind of like that episode in Robocop where the OCP committee rewrites his original four directives with several hundred: https://www.youtube.com/watch?v=Yr1lgfqygio

      2 replies →

It's Californian culture shining through. I don't think they realize the rest of the world dislikes this vacuous flattery.

For sure. If I want feedback on some writing I’ve done these days I tell it I paid someone else to do the work and I need help evaluating what they did well. Cuts out a lot of bullshit.