Comment by XCSme
1 day ago
They mentioned in their release page, that the Claude team noticed memorization of the SWE-bench test, so the test is actually in the training data.
Here: https://www.anthropic.com/news/claude-opus-4-7#:~:text=memor...
1 day ago
They mentioned in their release page, that the Claude team noticed memorization of the SWE-bench test, so the test is actually in the training data.
Here: https://www.anthropic.com/news/claude-opus-4-7#:~:text=memor...
Any static benchmark older than 12-18 months is basically worthless, because the content will have spread all over the internet and have found its way into the latest model's training set.
Good luck arguing with SWE benchmark purists