Comment by mtrifonov
8 hours ago
Still downstream of the actual issue. The benchmarks measure capability and the bottleneck stopped being capability a while ago.
What you actually want to measure on these models is what they can SEE in production. Context shape, retrieval quality, tool use, ability to compose state across turns. None of that is in SWE-bench because SWE-bench is shaped like a one-shot problem set and frontier coding work isn't shaped like that anymore.
Even a perfectly contamination-free benchmark would mostly test the wrong axis. The model is already at human-grad-student level on isolated problems. The leverage is in how it operates inside a larger system. And that's almost like, a taste/preference issue, and virtually impossible to objectively measure.
No comments yet
Contribute on Hacker News ↗