Comment by codelion

1 year ago

I have also spent some time on 2) and implemented several approaches in this open source optimising llm proxy - https://github.com/codelion/optillm

In my experience it does work quite well, but we probably need different techniques for different tasks.

Maybe 1 is actually hat you just suggested - an RL approach to select the strategy for 2. Thank you for implementing optillm and working out all the various strategy options, it’s a really neat reference for understanding this space.

One item I’m very curious about is how do they get a score for use in the RL? in well defined games it’s easy to understand but in this LLM output context how does one rate the output result for use in an RL setup?

  • That’s the hardest part, figuring out the reward. For generic tasks it is not easy, in my implementation in optillm I am using the llm itself to generate a score based on the mcts trajectory. But that is not as good as having a reward that is well defined say for a coding or logic problem. May be they trained a better reward model.