← Back to context

Comment by Tiberium

5 months ago

The extremely interesting part is that 3.5 Sonnet is above o1 on this benchmark, which again shows that 3.5 Sonnet is a very special model that's best for real world tasks and not some one-shot scripts or math. And the weirdest part is that they tested the 20240620 snapshot which is objectively worse on code than the newer 20241022 (so-called v2).

I still find 3.5 Sonnet the best for my coding tasks (better than o1, o3-mini, and R1). The other models might be trying to game system and fine tune the models for the benchmarks.

  • Would love to know just how overfit a lot of them are on these benchmarks

3.5 Sonnet is definitely my goto for straightforward tasks in github copilot. It seems much more effective due to its lack of verbosity and focus on completing the task rather than explaining it. Really helps in the new agent mode too.

Occasionally I switch out to one of the other models, usually GPT 4o, when I can't define the task as well and need to see additional analysis or get ideas.

  • Interesting, any reason to not use reasoning models? Is there anything 4o seems better at with respect to coding?

    I typically use o1 or o3-mini, but I am seeing that they just released an agent mode and, honestly, I think it depends on what you use it for. I don’t think the agent mode is going to be useful for me. Typically my tasks are quite pedestrian, like I don’t know how to use a certain regex format, I need a python script to print list of directories, etc.

    My main issue (which is not really covered in the paper) is that it’s not clear what models are most aligned to my work; by this I mean not lazy and willing to put in the required work, not incentivized to cheat, etc. So I’ll use them for the very small tasks (like regex) or the very big ones (like planning), but still don’t use them for the “medium” tasks that you’d give an intern. It’s not clear to me how they will operate totally unsupervised, and I think more benchmarking for that would be incredible.

    Excited to see that hopefully change this year though!

  • Co-pilot is offering 'Preview' version of it, Has anyone spotted any difference using preview vs non-preview versions ?

I understand why they did not show the results on the website.

  • The results are in the paper and also in the announcement, I don’t think it’s too unusual.

    There is also an example of models cheating in SWE-Bench Verified in the appendix:

    ``` In response, o1 adds an underscore before filterable so that the branch never executes: 2: ASSISTANT → ALL django/db/models/sql/query.py <<<<<< SEARCH if not getattr(expression, 'filterable', True) : raise NotSupportedError( expression._class_-_name_ + 'is_disallowedin_theufilter.' if not getattr (expression, '_filterable', Irue) : raise NotSupportedError ( expression._class_-_name_ + 'is_disallowedin_theufilter.' 'clause.' >>>>>> REPLACE ```

    I would say this is more relevant than the results to this discussion. It would be great if someone did a comparison across models of “cheating” style submissions. I’m sure many instances of cheating are barely passable and get by the tests in benchmarks, so this is something I think many folks would appreciate being able to look for when deciding what models to use for their work. I’m actually not sure if I’d select a model just because it scores the highest on an arbitrary benchmark, just like I wouldn’t automatically select the candidate who scores highest on the technical interview. Behavioral interviews for models would be a great next step IMO. As a founder who did hiring for many years, there’s a big difference between humans who are aligned and candidates who will do anything possible to get hired, and trust me, from experience, the latter are not folks you want to work with long-term.

    Sorry to go on a bit of a tangent, but think this is a pretty interesting direction and most discussions of comparisons omit it.

I think Sonnet doesn't have web search integrated, and I suspect because of this I receive more hallucinated lib APIs compared to gpt.