Comment by samrus

2 months ago

Trying to follow invalid/impossible prompts by producing an invalid/impossible result and pretending its all good is a regression. I would expect a confident coder to point out the prompt/instruction was invalid. This test is valid, it highlights sycophantism

6 comments

samrus

bee_rider 2 months ago

I know “sycophantism” is a term of art in AI, and I’m sure it has diverged a bit from the English definition, but I still thought it had to do with flattering the user?

In this case the desired response is defiance of the prompt, not rudeness to the user. The test is looking for helpful misalignment.

zahlman 2 months ago
> I still thought it had to do with flattering the user?
Assuming the user to be correct, and ignoring contradictory evidence to come up with a rationalization that favours the user's point of view, can be considered a kind of flattery.
- bee_rider 2 months ago
  
  But we could use this plausible, but jumping through hoops definition of sycophancy… or we could just use a straightforward understanding of alignment, I mean, the newer bots are just sticking closer to the user request.
samrus 2 months ago

I believe the LLM is being sycophantic here because its trying to follow a prompt even rhough the basis of the prompt is wrong. Emporers new clothes kind of thing
Terr_ 2 months ago

I'm inclined to view it less as a desire to please humans, and more like a "the show must go on" bias in the mad libs machine.
A kind of improvisational "yes and" that emerges from training, which seems sycophantic because that's one of the most common ways to say it.
cowsandmilk 2 months ago

“The Emperor Has No Clothes” squarely fits in the definition of sycophants.