Comment by marlburrow

18 hours ago

The "private benchmarks" suggestion comes up every time, but I think there's a more interesting axis: benchmarks built on top of already-public, already-stable test instruments. SWE-bench is fundamentally a corpus that lives on GitHub — once it ships, it leaks into training data automatically. Benchmarks built on contested qualitative instruments (psych tests, opinion surveys) have a different contamination profile because the correct answer doesn't exist in the training corpus to memorize — only the question does.

That doesn't help for measuring coding ability specifically (you fundamentally need a code-correctness oracle), but for capability axes where the "answer" is a stated position rather than a verifiable fact, public + stable can still be useful. The SWE-bench problem isn't really "public", it's "public + has a fixed correct answer".

0 comments

marlburrow

No comments yet

Contribute on Hacker News ↗