← Back to context

Comment by aspenmartin

3 days ago

> 1. LLMs are trained on human-quality data, so they will naturally learn to mimic our limitations. Their capabilities should saturate at human or maybe above-average human performance.

Why oh why is this such a commonly held belief. RL in verifiable domains being the way around this is the entire point. It’s the same idea behind a system like AlphaGo — human data is used only to get to a starting point for RL. RL will then take you to superhuman performance. I’m so confused why people miss this. The burden of proof is on people who claim that we will hit some sort of performance wall because I know of absolutely zero mechanisms for this to happen in verifiable domains.

I did mention RL as a valid counterargument in my comment.

I agree that in verifiable domains RL systems should be able to blow past human performance, and this might already be happening. There's another interesting question as to how much RL improves performance on non-verifiable domains. I'm not taking a stance either way, I just think it's an interesting question.