Comment by mountainriver

2 months ago

FlowRL is one, it’s learning the full distribution of rewards rather than just optimizing toward a single maximum

1 comment