Comment by olliepro
5 hours ago
The authors have some inconsistencies with training token length…
Most errors are probably responses that didn’t finish before their 3K token limit. They’ve measured how well RL is able to shorten the response to their limit.
No comments yet
Contribute on Hacker News ↗