Comment by ilitirit

1 day ago

I got downvoted for asking a related question recently, but I also don't think people really understood what I was asking - I'm not trying to anthropomorphise LLMs to that extent.

Basically, if you tell a model "You're an absolute moron, of course that's wrong!", will it give better or worse results? How much of that response will it absorb into its persona (like some humans tend to do)? Will it try to give "safer" responses to avoid negative feedback? How much of the associated behavior can be attributed to RLHF (e.g. like the sycophantic nature of LLMs)? How much can be attributed to training data?

Obviously this will vary by model and training, but I'm trying to get a general understanding.

I recall seeing related outcomes in some of Anthropic's studies, but I'm not sure how much of this particular aspect was studied.

2 comments

ilitirit

fennecfoxy 1 day ago

Probably quite a lot - if you look at what Anthropic found around persona vectors; https://www.anthropic.com/research/persona-vectors.

I imagine the context will always sway the model to some degree, not only for the task you're trying to get it to do (aka instructions) but also its persona, how accurate it is and the way it acts.

Foobar8568 11 hours ago

Based on my own experience with vibe coding difficult stuff outside of my expertise, I definitely got better outcome with Fuck you, shut up and do it, ffs, you are moron.