Comment by georgewsinger

6 months ago

Very impressive! But under arguably the most important benchmark -- SWE-bench verified for real-world coding tasks -- Claude 3.7 still remains the champion.[1]

Incredible how resilient Claude models have been for best-in-coding class.

[1] But by only about 1%, and inclusive of Claude's "custom scaffold" augmentation (which in practice I assume almost no one uses?). The new OpenAI models might still be effectively best in class now (or likely beating Claude with similar augmentation?).

45 comments

georgewsinger

jjani 6 months ago

Gemini 2.5 Pro is widely considered superior to 3.7 Sonnet now by heavy users, but they don't have an SWE-bench score. Shows that looking at one such benchmark isn't very telling. Main advantage over Sonnet being that it's better at using a large amount of context, which is enormously helpful during coding tasks.

Sonnet is still an incredibly impressive model as it held the crown for 6 months, which may as well be a decade with the current pace of LLM improvement.

unsupp0rted 6 months ago
Main advantage over Sonnet is Gemini 2.5 doesn't try to make a bunch of unrelated changes like it's rewriting my project from scratch.
- itsmevictor 6 months ago
  
  I find Gemini 2.5 truly remarkable and overall better than Claude, which I was a big fan of
  
  5 replies →
- bitbuilder 6 months ago
  
  This was incredibly irritating at first, though over time I've learned to appreciate this "extra credit" work. It can be fun to see what Claude thinks I can do better, or should add in addition to whatever feature I just asked for. Especially when it comes to UI work, Claude actually has some pretty cool ideas.
  If I'm using Claude through Copilot where it's "free" I'll let it do its thing and just roll back to the last commit if it gets too ambitious. If I really want it to stay on track I'll explicitly tell it in the prompt to focus only on what I've asked, and that seems to work.
  And just today, I found myself leaving a comment like this: //Note to Claude: Do not refactor the below. It's ugly, but it's supposed to be that way.
  Never thought I'd see the day I was leaving comments for my AI agent coworker.
  
  3 replies →
- erikw 6 months ago
  
  What language / framework are you using? I ask because in a Node / Typescript / React project I experience the opposite- Claude 3.7 usually solves my query on the first try, and seems to understand the project's context, ie the file structure, packages, coding guidelines, tests, etc, while Gemini 2.5 seems to install packages willy-nilly, duplicate existing tests, create duplicate components, etc.
  
  1 reply →
- jdgoesmarching 6 months ago
  
  Also that Gemini 2.5 still doesn’t support prompt caching, which is huge for tools like Cline.
  
  2 replies →
- Workaccount2 6 months ago
  
  It's viable context, context length where is doesn't fall apart, is also much longer.
- zaptrem 6 months ago
  
  I do find it likes to subtly reformat every single line thereby nuking my diff and making its changes unusable since I can’t verify them that way, which Sonnet doesn’t do.
armen52 6 months ago

I don't understand this assertion, but maybe I'm missing something?
Google included a SWE-bench score of 63.8% in their announcement for Gemini 2.5 Pro: https://blog.google/technology/google-deepmind/gemini-model-...
amedviediev 6 months ago

I keep seeing this sentiment so often here and on X that I have to wonder if I'm somehow using a different Gemini 2.5 Pro. I've been trying to use it for a couple of weeks already and without exaggeration it has yet to solve a single programming task successfully. It is constantly wrong, constantly misunderstands my requests, ignores constraints, ignores existing coding conventions, breaks my code and then tells me to fix it myself.
spaceman_2020 6 months ago

I feel that Claude 3.7 is smarter, but does way too much and has poor prompt adherence
redox99 6 months ago

2.5 Pro is very buggy with cursor. It often stops before generating any code. It's likely a cursor problem, but I use 3.7 because of that.
saberience 6 months ago

Eh, I wouldn't say that's accurate, I think it's situational. I code all day using AI tools and Sonnet 3.7 is still the king. Maybe it's language dependent or something, but all the engineers I know are full on Claude-Code at this point.

pizzathyme 6 months ago

The image generation improvement with o4-mini is incredible. Testing it out today, this is a step change in editing specificity even from the ChatGPT 4o LLM image integration just a few weeks ago (which was already a step change). I'm able to ask for surgical edits, and they are done correctly.

There isn't a numerical benchmark for this that people seem to be tracking but this opens up production-ready image use cases. This was worth a new release.

mchusma 6 months ago
Thanks for sharing that. that was more interesting then their demo. I tried it and it was pretty good! I have felt that the ability to iterate from images blocked this from any real production use I had. This may be good enough now.
Example of edits (not quite surgical but good): https://chatgpt.com/share/68001b02-9b4c-8012-a339-73525b8246...
- ec109685 6 months ago
  
  I don’t know if they let you share the actual images when sharing a chat. For me, they are blank.
ilaksh 6 months ago
wait, o4-mini outputs images? What I thought I saw was the ability to do a tool call to zoom in on an image.
Are you sure that's not 4o?
- AaronAPU 6 months ago
  
  I’m generating logo designs for merch via o4-mini-high and they are pretty good. Good text and comprehending my instructions.
  
  2 replies →
Agentus 6 months ago

also another addition: i previously tried to upload an image for chatgpt to edit and it was incapable under the previous model i tried. Now its able to change uploaded images using o4mini.

oofbaroomf 6 months ago

Claude got 63.2% according to the swebench.com leaderboard (listed as "Tools + Claude 3.7 Sonnet (2025-02-24)).[0] OpenAI said they got 69.1% in their blog post.

[0] swebench.com/#verified

georgewsinger 6 months ago
Yes, however Claude advertised 70.3%[1] on SWE bench verified when using the following scaffolding:
> For Claude 3.7 Sonnet and Claude 3.5 Sonnet (new), we use a much simpler approach with minimal scaffolding, where the model decides which commands to run and files to edit in a single session. Our main “no extended thinking” pass@1 result simply equips the model with the two tools described here—a bash tool, and a file editing tool that operates via string replacements—as well as the “planning tool” mentioned above in our TAU-bench results.
Arguably this shouldn't be counted though?
[1] https://www.anthropic.com/_next/image?url=https%3A%2F%2Fwww-...
- tedsanders 6 months ago
  
  I think you may have misread the footnote. That simpler setup results in the 62.3%/63.7% score. The 70.3% score results from a high-compute parallel setup with rejection sampling and ranking:
  > For our “high compute” number we adopt additional complexity and parallel test-time compute as follows:
  > We sample multiple parallel attempts with the scaffold above
  > We discard patches that break the visible regression tests in the repository, similar to the rejection sampling approach adopted by Agentless; note no hidden test information is used.
  > We then rank the remaining attempts with a scoring model similar to our results on GPQA and AIME described in our research post and choose the best one for the submission.
  > This results in a score of 70.3% on the subset of n=489 verified tasks which work on our infrastructure. Without this scaffold, Claude 3.7 Sonnet achieves 63.7% on SWE-bench Verified using this same subset.
  
  2 replies →
awestroke 6 months ago

OpenAI have not shown themselves to be trustworthy, I'd take their claims with a few solar masses of salt
swyx 6 months ago

they also gave more detail on their SWEBench scaffolding here https://www.latent.space/p/claude-sonnet

lattalayta 6 months ago

I haven't been following them that closely, but are people finding these benchmarks relevant? It seems like these companies could just tune their models to do well on particular benchmarks

mickael-kerjean 6 months ago

The benchmark is something you can optimize for, doesn't mean it generalize well. Yesterday I tried for 2 hours to get claude to create a program that would extract data from a weird adobe file. 10$ later, the best I had is a program that was doing something like:

  switch(testFile) {
    case "test1.ase": // run this because it's a particular case 
    case "test2.ase": // run this because it's a particular case
    default:  // run something that's not working but that's ok because the previous case should
              // give the right output for all the test files ...
  }

emp17344 6 months ago

That’s exactly what’s happening. I’m not convinced there’s any real progress occurring here.

knes 6 months ago

Right now the Swe-Bench leader Augment Agent still use Claude 3.7 in combo with o1. https://www.augmentcode.com/blog/1-open-source-agent-on-swe-...

The findings are open sourced on a repo too https://github.com/augmentcode/augment-swebench-agent

thefourthchime 6 months ago

Also, if you're using Cursor AI, it seems to have much better integration with Claude where it can reflect on its own things and go off and run commands. I don't see it doing that with Gemini or the O1 models.

ksec 6 months ago

I often wonder if we could expect that to reach 80% - 90% within next 5 years.