Comment by meroes
5 days ago
In the RLHF sphere you could tell some AI company/companies were targeting this because of how many IMO RLHF’ers they were hiring specifically. I don’t think it’s really easy to say how much “progress” this is given that.
I doubt this is coming from RLHF - tweets from the lead researcher state that this result flows from a research breakthrough which enables RLVR on less verifiable domains.
Math RLHF already has verifiable ground truth/right vs wrong, so I don't what this distinction really shows.
And AI changes so quickly that there is a breakthrough every week.
Call my cynical, but I think this is an RLHF/RLVR push in a narrow area--IMO was chosen as a target and they hired specifically to beat this "artificial" target.
RLHF means Reinforcement Learning from Human Feedback. The right/wrong ones are either called RL or RLVR (Verfiable Rewards)
They were hiring IMO winners because IMO winners tend to be good at working on AI, not because they had the people specifically to make the AI better at math.
Uh no. I’m a math RLHF’er. When I get hired, I work on math/logic up to masters level because that’s my qualifications. Masters and PHD work on masters and PHD level. And IMO work on IMO math.
Every skill and skill level is specifically assigned and hired in the RLHF world.
Sometime the skill levels are fuzzier, but that’s usually very temporary.
And as been said already, IMO is a specific skill that even PHD math holders aren’t universally trained for.