Comment by visarga
5 hours ago
No, they do RLVR (reinforcement learning with verifiable rewards) like everyone else. And probably use claude data too, with human in the loop and tool feedback.
5 hours ago
No, they do RLVR (reinforcement learning with verifiable rewards) like everyone else. And probably use claude data too, with human in the loop and tool feedback.
No comments yet
Contribute on Hacker News ↗