← Back to context

Comment by wongarsu

8 hours ago

My suspicion is that this agreeableness is an inherent issue with doing RLHF.

As a human taking tests, knowing what the test-grader wants to hear is more important than what the objectively correct answer is. And with a bad grader there can be a big difference between the two. With humans that is not catastrophic because we can easily tell the difference between a testing environment and a real environment and the differences in behavior required. When asking for the answer to a question it's not unusual to hear "The real answer is X, but in a test just write Y".

Now LLMs have the same issue during RLHF. The specifics are obviously different, with humans being sentient and LLMs being trained by backpropagation. But from a high-level view the LLM is still trained to answer what the human feedback wants to hear, which is not always the objectively correct answer. And because there are a large number of humans involved, the LLM has to guess what the human wants to hear from the only information it has: the prompt. And the LLM behaving differently in training and in deployment is something we actively don't want, so you get this teacher-pleasing behavior all the time.

So maybe it's not completely inherent to RLHF, but rather to RLHF where the person making the query is the same as the person scoring the answer, or where the two people are closely aligned. But that's true of all the "crowd-sourced" RLHF where regular users get two answers to their question and choose the better one

It's not even that. Only a kernel of the LLM is trained using RLHF. The rest is self-trained from corpus with a few test questions added into the mix.

Because it still cannot reason about veracity of sources, much less empirically try things out, the algorithm has no idea what makes for correctness...

It does not even understand fiction. Tends to return sci-fi answers every now and then to technical questions.

I hadn't thought of it like that, but it makes sense. The LLMs are essentially bred for the ones which give the 'best' answers (best fit to the test-takers expectation), which isn't always the 'right' answer. A parallel might be media feed algorithms which are bred to give recommendations with the most 'engagement' rather than the most 'entertainment'.