← Back to context

Comment by noch

1 year ago

> So RLHF is the secret sauce behind modern LLMs?

Karpathy wrote[^0]:

"

RL is powerful. RLHF is not.

[…]

And yet, RLHF is a net helpful step of building an LLM Assistant. I think there's a few subtle reasons but my favorite one to point to is that through it, the LLM Assistant benefits from the generator-discriminator gap. That is, for many problem types, it is a significantly easier task for a human labeler to select the best of few candidate answers, instead of writing the ideal answer from scratch.

[…]

No production-grade actual RL on an LLM has so far been convincingly achieved and demonstrated in an open domain, at scale.

"

---

[^0]: https://x.com/karpathy/status/1821277264996352246

RL on any production system is very tricky, and so it seems difficult to work in any open domain, not just LLMs. My suspicion is that RL training is a coalgebra to almost every other form of ML and statistical training, and we don't have a good mathematical understanding how it behaves.