Comment by janalsncm
2 months ago
I think a lot of people in the ML community were excited for Noam Brown to lead the O series at OpenAI because intuitively, a lot of reasoning problems are highly nonlinear i.e. they have a tree-like structure. So some kind of MCTS would work well. O1/O3 don’t seem to use this, and DeepSeek explicitly mentioned difficulties training such a model.
However, I think this is coming. DeepSeek mentioned it was hard to learn a value model for MCTS from scratch, but this doesn’t mean we couldn’t seed it with some annotated data.
Do you have a reference for us to check? - "DeepSeek explicitly mentioned difficulties training such a model."
Section 4.2: Unsuccessful attempts
https://arxiv.org/pdf/2501.12948