Comment by codelion

1 year ago

That’s the hardest part, figuring out the reward. For generic tasks it is not easy, in my implementation in optillm I am using the llm itself to generate a score based on the mcts trajectory. But that is not as good as having a reward that is well defined say for a coding or logic problem. May be they trained a better reward model.

0 comments