Slacker News Slacker News logo featuring a lazy sloth with a folded newspaper hat
  • top
  • new
  • show
  • ask
  • jobs
Library
← Back to context

Comment by codelion

1 year ago

That’s the hardest part, figuring out the reward. For generic tasks it is not easy, in my implementation in optillm I am using the llm itself to generate a score based on the mcts trajectory. But that is not as good as having a reward that is well defined say for a coding or logic problem. May be they trained a better reward model.

0 comments

codelion

Reply

No comments yet

Contribute on Hacker News ↗

Slacker News

Product

  • API Reference
  • Hacker News RSS
  • Source on GitHub

Community

  • Support Ukraine
  • Equal Justice Initiative
  • GiveWell Charities