Comment by Certhas

3 days ago

I didn't, hence the "first". It's clear that being good at next token prediction forces the models to learn a lot, including giving such answers. But it's not their loss function. Presumably they would be capable of lying and insulting you with the right system prompt just as well. And I doubt RLHF gets rid of this ability.

1 comment

Certhas

nearbuy 1 day ago

If you didn't forget about the RLHF, your comment is oddly pedantic, confusing and misleading. "Correct and satisfying answers" is roughly the loss function for RLHF, assuming the humans favor satisfying answers, and using "loss function" loosely, as you yourself do, by gesturing at what the loss function is meant to do rather than formally describing an actual function. The comment you responded to didn't say this was the only loss function during all stages of training. Just that "When your loss function is X", then Y happens.

You could have just acknowledged they are roughly correct about RLHF, but brought up issues caused by pretraining.

> And I doubt RLHF gets rid of this ability.

The commenter you were replying to is worried the RLHF causes lying.