Comment by storus
3 days ago
RL is extremely brittle, it's often difficult to make it converge. Even Stanford folks admit that. Are there any solutions for this?
3 days ago
RL is extremely brittle, it's often difficult to make it converge. Even Stanford folks admit that. Are there any solutions for this?
FlowRL is one, it’s learning the full distribution of rewards rather than just optimizing toward a single maximum
Thanks, that looks very promising!