← Back to context

Comment by gwd

3 months ago

> In my experience, LLMs have always had a tendency towards sycophancy

The very early ones (maybe GPT 3.0?) sure didn't. You'd show them they were wrong, and they'd say something that implied that OK maybe you were right, but they weren't so sure; or that their original mistake was your fault somehow.

Were those trained using RLHF? IIRC the earliest models were just using SFT for instruction following.

Like the GP said, I think this is fundamentally a problem of training on human preference feedback. You end up with a model that produces things that cater to human preferences, which (necessarily?) includes the degenerate case of sycophancy.