Comment by anotherpaulg

5 months ago

Claude 3.7 Sonnet scored 60.4% on the aider polyglot leaderboard [0], WITHOUT USING THINKING.

Tied for 3rd place with o3-mini-high. Sonnet 3.7 has the highest non-thinking score, taking that title from Sonnet 3.5.

Aider 0.75.0 is out with support for 3.7 Sonnet [1].

Thinking support and thinking benchmark results coming soon.

[0] https://aider.chat/docs/leaderboards/

[1] https://aider.chat/HISTORY.html#aider-v0750

143 comments

anotherpaulg

nightpool 5 months ago

> 225 coding exercises from Exercism

Has there been any effort taken to reduce data leakage of this test set? Sounds like these exercises were available on the internet pre-2023, so they'll probably be included in the training data for any modern model, no?

anotherpaulg 5 months ago
I try not to let perfect be the enemy of good. All benchmarks have limitations.
The Exercism problems have proven to be very effective at measuring an LLM's ability to modify existing code. I receive a lot of feedback that the aider benchmarks correlate strongly with people's "vibes" on model coding skill. I agree. The scores have felt quite aligned with my hands-on experience coding with most of the top models over the last 18+ months.
To be clear, the purpose of the benchmark is to help me quantitatively assess and improve aider and make it more effective. But it's also turned out to be a great way to measure the coding skill of LLMs.
- guccihat 5 months ago
  
  > The Exercism problems have proven to be very effective at measuring an LLM's ability to modify existing code
  The Aider Polyglot website also states that the benchmark " ...asks the LLM to edit source files to complete 225 coding exercises".
  However, when looking at the actual tests [0], it is not about editing code bases, it's rather just solving simple programming exercies? What am I missing?
  [0] https://github.com/Aider-AI/polyglot-benchmark
- jrflowers 5 months ago
  
  >I try not to let perfect be the enemy of good. All benchmarks have limitations.
  Overfitting is one of the fundamental issues to contend with when trying to figure out if any type of model at all is useful. If your leaderboard corresponds to vibes and that is your target, you could just have a vibes leaderboard
- rodrigodlu 5 months ago
  
  That's my perception as well. Most of the time, most of the devs I know, including myself, are not really creating novelty with the code itself, but with the product. (Sometimes even the product is not novel, just a similar enhanced version of existing products)
  If the resulting code is not trying to be excessively clever or creative this is actually a good thing in my book.
  The novelty and creativity should come from the product itself, especially from the users/customers perspective. Some people are too attached to LLM leaderboards being about novelty. I want reliable results whenever I give the instructions, either be the code, or the specs built into a spec file after throwing some ideas into prompts.
- Marazan 5 months ago
  
  Having the verbatim answer to the test is not a "limitation" it is an invalidation.
  
  1 reply →
- szundi 5 months ago
  
  [dead]
jonplackett 5 months ago
I like to make up my own tests, that way you know it is actually thinking.
Tests that require thinking about the physical world are the most revealing.
My new favourite is:
You have 2 minutes to cool down a cup of coffee to the lowest temp you can.
You have two options: 1. Add cold milk immediately, then let it sit for 2 mins.
2. Let it sit for 2 mins, then add cold milk.
Which one cools the coffee to the lowest temperature and why?
Phrased this way without any help, all but the thinking models get it wrong
- danbruc 5 months ago
  
  No need for thinking, that question can be found discussed and explained many times online and has almost certainly been part of the training data.
  
  2 replies →
- akoboldfrying 5 months ago
  
  The fact that the answer is interesting makes me suspect that it's not a good test for thinking. I remember reading the explanation for the answer somewhere on the internet years ago, and it's stayed with me ever since. It's interesting enough that it's probably been written about multiple times in multiple places. So I think it would probably stay with a transformer trained on large volumes of data from the internet too.
  I think a better test of thinking is to provide detail about something so mundane and esoteric that no one would have ever thought to communicate it to other people for entertainment, and then ask it a question about that pile of boring details.
  
  10 replies →
- ur-whale 5 months ago
  
  > I like to make up my own tests
  You just ruined your own test by publishing it on the internets
  
  1 reply →
- pythonaut_16 5 months ago
  
  I’m not sure how much this tells me about a model’s coding ability though.
  It might correlate to design level thinking but it also might not.
- vintermann 5 months ago
  
  I have another easy one which thinking models get wrong:
  "Anhentafel numbers start with you as 1. To find the Ahhentafel number of someone's father, double it. To find the Ahnentafel number of someone's mother, double it and add one.
  Men pass on X chromosome DNA to their daughters, but none to their sons. Women pass on X chromosome DNA to both their sons and daughters.
  List the Ahnentafel numbers of the closest 20 ancestors a man may have inherited X DNA from."
  For smaller models, it's probably fair to change the question to something like: "Could you have inherited X chromosome DNA from your ancestor with Ahnentafel number 33? Does the answer to that question depend on whether you are a man or a woman?" They still fail.
  
  2 replies →
- freehorse 5 months ago
  
  I asked this to QwQ and it started writing equations (newton's law) and arrived at T_2 < T_1, then said this is counterintuitive, started writing more equations and arrived to the same, starts writing an explanation on why this is indeed the case instead of what it is intuitive, and concludes to the right answer.
  It is the only model I gave this and actually approached it by writing math. Usually I am not that impressed with reasoning models, but this was quite fun to watch.
- mrcwinn 5 months ago
  
  Obviously you would prepare cold brew the night before.
- atlex2 5 months ago
  
  Yes absolutely this! We're working on these problems at FlyShirley for our pilot training tool. My go-to is: I'm facing 160 degrees and want to face north. What's the quickest way to turn and by how much?
  For small models and when attention is "taken up", these sorts of questions really send a model for a loop. Agreed - especially noticeable with small reasoning models.
  
  1 reply →
- astrodude 5 months ago
  
  how do you know that this test doesn't already exist somewhere on the internet and LLMs were trained on it?
- jonplackett 5 months ago
  
  Can’t edit my comment - this was meant to say NON-thinking models can’t do it. The thinking ones can and that’s what is interesting (face palm)
  
  1 reply →
- iamnotagenius 5 months ago
  
  [dead]
- gwd 5 months ago
  
  > Phrased this way without any help, all but the thinking models get it wrong
  I C&P'd it into Claude 3.7 with thinking, and it gave the correct answer (which I'm pretty sure is #2). Including the CoT, where it actually does math (which I haven't checked), and final response.
  # THINKING
  Let's analyze the two options.
  Option 1: Add cold milk immediately, then let it sit for 2 mins.
  Option 2: Let it sit for 2 mins, then add cold milk.
  To determine which option leads to the lowest final temperature, I need to consider the physics of cooling.
  First, let's understand how cooling works in these scenarios:
  1. When a hot liquid sits, it cools according to Newton's Law of Cooling, which states that the rate of heat loss is proportional to the temperature difference between the object and its surroundings.
  - The rate of cooling is faster when the temperature difference is greater
  - As the liquid cools, the rate of cooling decreases
  2. When we add cold milk to hot coffee, we get an immediate temperature drop due to mixing liquids of different temperatures. The final temperature after mixing can be calculated using the principle of thermal equilibrium.
  Now, let's analyze each option:
  Option 1: Add cold milk immediately, then let it sit for 2 mins
  - The coffee temperature drops immediately when milk is added
  - Then this coffee+milk mixture cools for 2 minutes
  - The cooling rate will be based on the temperature difference between the coffee+milk mixture and room temperature
  Option 2: Let it sit for 2 mins, then add cold milk
  - The black coffee cools for 2 minutes
  - Then milk is added, causing an immediate drop in temperature
  - The cooling rate during the 2 minutes will be based on the temperature difference between the black coffee and room temperature
  The key insight is that hot coffee cools faster than warm coffee because the rate of cooling is proportional to the temperature difference. Since black coffee starts at a higher temperature than coffee with milk, it will lose more heat energy during the 2-minute cooling period.
  Let's make this more concrete with some theoretical numbers:
  - Let's say black coffee starts at 90°C
  - Room temperature is 20°C
  - Cold milk is at 5°C and makes up 10% of the final volume
  Option 1:
  - Adding milk immediately might bring the temperature to about 81.5°C (90 × 0.9 + 5 × 0.1)
  - This mixture then cools for 2 minutes from 81.5°C to some lower temperature, let's call it T1
  Option 2:
  - The black coffee cools for 2 minutes from 90°C to some temperature, let's call it T2
  - Then we add milk, which further reduces the temperature to 0.9 × T2 + 0.1 × 5
  The question is: which is lower, T1 or 0.9 × T2 + 0.1 × 5?
  According to Newton's Law of Cooling, the black coffee in Option 2 will lose more heat during the 2 minutes because it starts at a higher temperature and thus has a greater temperature difference with the surroundings. This means the drop from 90°C to T2 will be greater than the drop from 81.5°C to T1.
  Let's think about this mathematically:
  According to Newton's Law of Cooling: T(t) = Ts + (T0 - Ts) × e^(-kt) Where:
  - T(t) is the temperature at time t
  - Ts is the surrounding temperature
  - T0 is the initial temperature
  - k is the cooling constant
  For Option 1:
  T1 = 20 + (81.5 - 20) × e^(-k × 2) = 20 + 61.5 × e^(-2k)
  For Option 2, before adding milk:
  T2 = 20 + (90 - 20) × e^(-k × 2) = 20 + 70 × e^(-2k)
  After adding milk in Option 2, the temperature would be:
  0.9 × T2 + 0.1 × 5 = 0.9 × (20 + 70 × e^(-2k)) + 0.5
  = 18 + 63 × e^(-2k) + 0.5
  = 18.5 + 63 × e^(-2k)
  Now we compare:
  T1 = 20 + 61.5 × e^(-2k)
  Final temp in Option 2 = 18.5 + 63 × e^(-2k)
  For any positive value of k (which it must be for cooling), e^(-2k) is between 0 and 1.
  If e^(-2k) = 1 (no cooling at all):
  T1 = 20 + 61.5 = 81.5
  Final temp in Option 2 = 18.5 + 63 = 81.5 (They would be the same)
  If e^(-2k) = 0 (instant cooling to room temperature): T1 = 20 Final temp in Option 2 = 18.5 (Option 2 would be better)
  For values in between, we need to compare 20 + 61.5 × e^(-2k) vs 18.5 + 63 × e^(-2k)
  Subtracting: (T1 - Final temp in Option 2) = 1.5 - 1.5 × e^(-2k)
  This will be positive (meaning T1 is higher) when e^(-2k) < 1, which is true for all positive k.
  So, mathematically, Option 2 (letting it sit, then adding milk) results in a lower final temperature.
  This makes physical sense because:
  1. The hotter black coffee loses more heat during the 2 minutes than the cooler coffee+milk mixture would
  2. The cold milk provides the same cooling effect whether added at the beginning or end
  Therefore, Option 2 (letting the coffee sit for 2 minutes, then adding cold milk) will result in the lowest final temperature.
  # ANSWER
  # Coffee Cooling Analysis
  ## Option 2 wins: Let it sit for 2 mins, then add cold milk
  This will give you the lowest final temperature, and here's why:
  ### The physics behind it:
  1. *Newton's Law of Cooling*: The rate at which something cools is proportional to the temperature difference between the object and its surroundings.
  2. *Key insight*: Hotter objects cool faster than cooler ones because the temperature gap with the environment is larger.
  ### Comparing the options:
  *Option 1 (Add milk first, then wait):* - Milk immediately lowers coffee temperature - The coffee+milk mixture cools for 2 minutes, but at a slower rate because it's already cooler
  *Option 2 (Wait 2 mins, then add milk):* - Black coffee starts hotter and cools faster during the 2 minutes - More total heat is lost during the waiting period - Then milk provides the same temperature drop at the end
  The black coffee's higher initial temperature difference with the room causes it to lose more total heat energy during the waiting period, making Option 2 the more effective cooling strategy.
  
  11 replies →
chvid 5 months ago
They leak the second they are used on a model behind an API, don't they?
- chvid 5 months ago
  
  As far as I can tell the only way of doing a comparison of two models, that cannot be easily gamed, is being having them in open weights form and then running them against a benchmark that was created after both of the two models were created.

anotherpaulg 5 months ago

Using up to 32k thinking tokens, Sonnet 3.7 set SOTA with a 64.9% score.

  65% Sonnet 3.7, 32k thinking
  64% R1+Sonnet 3.5
  62% o1 high
  60% Sonnet 3.7, no thinking
  60% o3-mini high
  57% R1
  52% Sonnet 3.5

mikae1 5 months ago
It's clear that progress is incremental at this point. At the same time Anthropic and OpenAI are bleeding money.
It's unclear to me how they'll shift to making money while providing almost no enhanced value.
- khafra 5 months ago
  
  Yudkowsky just mentioned that even if LLM progress stopped right here, right now, there are enough fundamental economic changes to provide us a really weird decade. Even with no moat, if the labs are in any way placed to capture a little of the value they've created, they could make high multiples of their investors' money.
  
  29 replies →
vessenes 5 months ago
Paul, I saw in the notes that using claude with thinking mode requires yml config updates -- any pointers here? I was parsing some commits, and I couldn't tell if you only added architect support through openrouter. Thanks!
- anotherpaulg 5 months ago
  
  Here are the current docs for changing the thinking token limits.
  https://aider.chat/docs/llms/anthropic.html#thinking-tokens
  I'll make this less clunky soon.
  
  1 reply →
pclmulqdq 5 months ago
Also for $36.83 compared to o1's $186.50
- pzo 5 months ago
  
  But also for $36.83 compared to DeepSeek R1 + claude-3-5 it's $13.29 and for latter "Percent using correct edit format" is 100% vs 97.8% for 3.7.
  edit: would be interesting to see how combo DeepSeek R1 + claude-3-7 performs.
  
  6 replies →
VectorLock 5 months ago
How does it stack up against Grok3? I've seen some discussion that Grok3 is good for coding.
- viraptor 5 months ago
  
  It isn't available over api yet, as far as I know. So it can't be really tested independently.
  
  1 reply →
- pclmulqdq 5 months ago
  
  Pro tip: It's hard to trust Twitter for opinions on Grok. The thumb is very clearly on the scale. I have personally seen very few positive opinions of Grok outside of Twitter.
  
  2 replies →

gwd 5 months ago

Interesting that the "correct diff format" score went from 99.6% with Claude 3.5 to 93.3% for Claude 3.7. My experience with using claude-code was that it consistently required several tries to get the right diff. Hopefully all that will improve as they get things ironed out.

macNchz 5 months ago
Reasoning models pretty reliably seem to do worse at exacting output formats/structured outputs—so far with Aider it has been an effective strategy to employ o1 to “think” about the issue at hand, and have Sonnet implement. Interested to try various approaches with 3.7 in various combinations of reasoning effort.
- bugglebeetle 5 months ago
  
  It’s funny because I also have found myself doing this exact with R1+Sonnet 3.5 recently. Windsurf allows you to do a chat mode exchange with one model and then switch to another to implement. The reasoning models all seem pretty poorly implemented for the agentic workflows, but work well when paired with Claude.
WatchDog 5 months ago

3.7 completed a lot more than 3.5, without seeing the actual results, we can't tell if there were any regressions in the edit format among the previously completed tasks.
Sterling9x 5 months ago
That's a file context problem because you use cursor or cline or some other crap context maker. Try Clood.
Unless "anthropic high usage" which I just watch the incident reports I one shot features regularly.
At a high skill level. Not front end. Back end c# in a small but great framework that has poor documentation. Not just endpoints but full on task queues.
So really, it's a context problem. You're just not laser focusing your context.
Try this:
Set up a context with the exact files needed. Sure ai "should" do that but it doesn't. Especially not cursor or cline. Then try.
Hell try it with clood after I update with 3.7. I bet you, if you clood file it, then you get one shots.
I have a long history of clood being a commit in my projects and it's a clood one shot.
- nuancebydefault 5 months ago
  
  Ah, the issue is contextual flux in your Clood-Cline stack. Just quantum defrag the file vectors, reverse-polarize the delta stream, and inject a neural bypass. If that fails, reboot the universe. One-shot cloodfile guaranteed.
  /i
- DonHopkins 5 months ago
  
  Have you tried running a level 1 diagnostic on the subspace bypass?
- rudedogg 5 months ago
  
  Wtf is “clood”?
  
  6 replies →

billmalarky 5 months ago

Hi Paul, been following the aider project for about a year now to develop an understanding of how to build SWE agents.

I was at the AI Engineering Summit in NYC last week and met an (extremely senior) staff ai engineer doing somewhat unbelievable things with aider. Shocking things tbh.

Is there a good way to share stories about real-world aider projects like this with you directly (if I can get approval from him)? Not sure posting on public forum is appropriate but I think you would be really interested to hear how people are using this tool at the edge.

tecleandor 5 months ago

Hope it gets to be public, I love to learn "weird" (or unusual) ways of using tools

bearjaws 5 months ago

Thanks for all the work on aider, my favorite AI tool.

bt1a 5 months ago
It really is best in slot. Owe it to git, which has a particular synergy with a hallucination-prone but correctable system
- doctoboggan 5 months ago
  
  I like Aider but I've turned off auto-commit. I just can't seem to let the AI actually commit code for me. Do you regularly let Aider commit for you? How much do you review the code written by it?
  
  7 replies →

doctoboggan 5 months ago

Have you tried Claude 3.7 + Deepseek as the architect? Seeing as "DeepSeek R1 + claude-3-5-sonnet-20241022" is the second place option, "DeepSeek R1 + claude-3-7" would hopefully be the highest ranking choice so far?

SparkyMcUnicorn 5 months ago
It looks like Sonnet 3.7 (extended thinking) would be a better architect than R1.
I'll be trying out Sonnet 3.7 extended thinking + Sonnet 3.5 or Flash 2.0, which I assume would be at the top of the leaderboard.
- attentive 5 months ago
  
  given 3.5 and 3.7 cost the same, it doesn't make sense to use 3.5 here.
  I'd like to see that benchmark, but R1 + 3.7 should be cheaper than 3.7T + 3.7
  
  1 reply →

sheepdestroyer 5 months ago

Nice !

Could we please get benchmarks for architect / DeepSeek R1 + claude-3-7-20250219 ?

To compare perf and price with Sonnet-3.7-thinking

stavros 5 months ago

I'd like to second the thanks for Aider, I use it all the time.

darkotic 5 months ago

Working really well for me. Thanks for Aider!

usaar333 5 months ago

Updated. #1 with thinking

liamYC 5 months ago

I’d like to 3rd the thanks for Aider it’s fantastic!

throwaway454812 5 months ago

Any chance you can add support for Vertex AI Sonnet 3.7, which looks like it's available now? Thank you!

createaccount99 5 months ago

Is aider still relevant vs. Claude Code?

billmalarky 5 months ago

Yes. Absolutely it is. For different workloads it is an insanely effective tool.

miroljub 5 months ago

And yet, "DeepSeek R1 + claude-3-5-sonnet-20241022" scores 64% on the same benchmark 30% cheaper.

It's amazing what Deepseek is putting on the table while being full open source.

SweetSoftPillow 5 months ago

r1 + Claude 3.7 when?

SamBam 5 months ago

I like that we're just saying they're thinking now. John Searle would be furious.

(I kid, I know what is meant by that.)