Comment by noch

1 year ago

> So RLHF is the secret sauce behind modern LLMs?

Karpathy wrote[^0]:

RL is powerful. RLHF is not.

[…]

And yet, RLHF is a net helpful step of building an LLM Assistant. I think there's a few subtle reasons but my favorite one to point to is that through it, the LLM Assistant benefits from the generator-discriminator gap. That is, for many problem types, it is a significantly easier task for a human labeler to select the best of few candidate answers, instead of writing the ideal answer from scratch.

[…]

No production-grade actual RL on an LLM has so far been convincingly achieved and demonstrated in an open domain, at scale.

---

[^0]: https://x.com/karpathy/status/1821277264996352246

1 comment

noch

parodysbird 1 year ago

RL on any production system is very tricky, and so it seems difficult to work in any open domain, not just LLMs. My suspicion is that RL training is a coalgebra to almost every other form of ML and statistical training, and we don't have a good mathematical understanding how it behaves.