Comment by storus

2 months ago

RL is extremely brittle, it's often difficult to make it converge. Even Stanford folks admit that. Are there any solutions for this?

2 comments

storus

FlowRL is one, it’s learning the full distribution of rewards rather than just optimizing toward a single maximum