Comment by noch
1 year ago
> So RLHF is the secret sauce behind modern LLMs?
Karpathy wrote[^0]:
"
RL is powerful. RLHF is not.
[…]
And yet, RLHF is a net helpful step of building an LLM Assistant. I think there's a few subtle reasons but my favorite one to point to is that through it, the LLM Assistant benefits from the generator-discriminator gap. That is, for many problem types, it is a significantly easier task for a human labeler to select the best of few candidate answers, instead of writing the ideal answer from scratch.
[…]
No production-grade actual RL on an LLM has so far been convincingly achieved and demonstrated in an open domain, at scale.
"
---
RL on any production system is very tricky, and so it seems difficult to work in any open domain, not just LLMs. My suspicion is that RL training is a coalgebra to almost every other form of ML and statistical training, and we don't have a good mathematical understanding how it behaves.