Comment by ainch
5 days ago
I doubt this is coming from RLHF - tweets from the lead researcher state that this result flows from a research breakthrough which enables RLVR on less verifiable domains.
5 days ago
I doubt this is coming from RLHF - tweets from the lead researcher state that this result flows from a research breakthrough which enables RLVR on less verifiable domains.
Math RLHF already has verifiable ground truth/right vs wrong, so I don't what this distinction really shows.
And AI changes so quickly that there is a breakthrough every week.
Call my cynical, but I think this is an RLHF/RLVR push in a narrow area--IMO was chosen as a target and they hired specifically to beat this "artificial" target.
RLHF means Reinforcement Learning from Human Feedback. The right/wrong ones are either called RL or RLVR (Verfiable Rewards)