Comment by EnPissant

19 hours ago

> 1. SWE-bench Verified is now saturated at 93.9% (congrats Anthropic), but anyone who hasn't reached that number yet still has more room for growth.

But if some or all players are bench-maxing it, then it becomes a much less useful metric for comparison.

Also, this doesn't address what OpenAI says about the test suite disallowing valid solutions.