← Back to context

Comment by nearbuy

1 day ago

If you didn't forget about the RLHF, your comment is oddly pedantic, confusing and misleading. "Correct and satisfying answers" is roughly the loss function for RLHF, assuming the humans favor satisfying answers, and using "loss function" loosely, as you yourself do, by gesturing at what the loss function is meant to do rather than formally describing an actual function. The comment you responded to didn't say this was the only loss function during all stages of training. Just that "When your loss function is X", then Y happens.

You could have just acknowledged they are roughly correct about RLHF, but brought up issues caused by pretraining.

> And I doubt RLHF gets rid of this ability.

The commenter you were replying to is worried the RLHF causes lying.