Comment by Davidzheng
5 days ago
I don't think it's much less exciting if they ran it 10000 parallel? It implies an ability to discern when the proof is correct and rigorous (which o3 can't do consistently) and also means that outputting the full proof is within capabilities even if rare.
The whole point of RL is if you can get it to work 0.01% of the time you can get it to work 100% of the time.