Comment by olliepro

5 hours ago

The authors have some inconsistencies with training token length…

Most errors are probably responses that didn’t finish before their 3K token limit. They’ve measured how well RL is able to shorten the response to their limit.