← Back to context

Comment by GaggiX

5 months ago

I understand why they did not show the results on the website.

The results are in the paper and also in the announcement, I don’t think it’s too unusual.

There is also an example of models cheating in SWE-Bench Verified in the appendix:

``` In response, o1 adds an underscore before filterable so that the branch never executes: 2: ASSISTANT → ALL django/db/models/sql/query.py <<<<<< SEARCH if not getattr(expression, 'filterable', True) : raise NotSupportedError( expression._class_-_name_ + 'is_disallowedin_theufilter.' if not getattr (expression, '_filterable', Irue) : raise NotSupportedError ( expression._class_-_name_ + 'is_disallowedin_theufilter.' 'clause.' >>>>>> REPLACE ```

I would say this is more relevant than the results to this discussion. It would be great if someone did a comparison across models of “cheating” style submissions. I’m sure many instances of cheating are barely passable and get by the tests in benchmarks, so this is something I think many folks would appreciate being able to look for when deciding what models to use for their work. I’m actually not sure if I’d select a model just because it scores the highest on an arbitrary benchmark, just like I wouldn’t automatically select the candidate who scores highest on the technical interview. Behavioral interviews for models would be a great next step IMO. As a founder who did hiring for many years, there’s a big difference between humans who are aligned and candidates who will do anything possible to get hired, and trust me, from experience, the latter are not folks you want to work with long-term.

Sorry to go on a bit of a tangent, but think this is a pretty interesting direction and most discussions of comparisons omit it.