Comment by mountainriver

3 months ago

FlowRL is one, it’s learning the full distribution of rewards rather than just optimizing toward a single maximum

1 comment