Comment by anotherpaulg

5 months ago

Using up to 32k thinking tokens, Sonnet 3.7 set SOTA with a 64.9% score.

  65% Sonnet 3.7, 32k thinking
  64% R1+Sonnet 3.5
  62% o1 high
  60% Sonnet 3.7, no thinking
  60% o3-mini high
  57% R1
  52% Sonnet 3.5

48 comments

anotherpaulg

mikae1 5 months ago

It's clear that progress is incremental at this point. At the same time Anthropic and OpenAI are bleeding money.

It's unclear to me how they'll shift to making money while providing almost no enhanced value.

khafra 5 months ago
Yudkowsky just mentioned that even if LLM progress stopped right here, right now, there are enough fundamental economic changes to provide us a really weird decade. Even with no moat, if the labs are in any way placed to capture a little of the value they've created, they could make high multiples of their investors' money.
- weatherlite 5 months ago
  
  Like what economic changes? You can make a case people are 10% more productive in very specific fields (programming, perhaps consultancy etc). That's not really an earthquake, the internet/web was probably way more significant.
  
  17 replies →
- zeroq 5 months ago
  
  It's an echo chamber.
  It is - what? - a fifth anniversary of "the world will be a completely different place in 6 months due to AI advancement"?
  "Sam Altman believes AI will change the world" - of course he does, what else is he supposed to say?
  
  1 reply →
- jonplackett 5 months ago
  
  Yep totally agree. It will also depend who captures the most eyeballs.
  ChatGPT is already my default first place to check something, where it was Google for the previous 20+ years.
  
  3 replies →
- dragonwriter 5 months ago
  
  With no moat, they aren't placed to capture much value; moats are what stops market competition from driving prices to the zero economic profit level, and that's even without further competition from free products that are being produced by people who aren’t even trying to support themselves in the market you are selling into, which can make even the zero economic profit price untenable.
  
  3 replies →
- Amekedl 5 months ago
  
  Oh really? How are these changes supposed to look like? Who will pay up essentially? I don't really see it, aside from the m$ business case of offering AI as a guise for violating privacy much harsher to better sell ads.

vessenes 5 months ago

Paul, I saw in the notes that using claude with thinking mode requires yml config updates -- any pointers here? I was parsing some commits, and I couldn't tell if you only added architect support through openrouter. Thanks!

anotherpaulg 5 months ago
Here are the current docs for changing the thinking token limits.
https://aider.chat/docs/llms/anthropic.html#thinking-tokens
I'll make this less clunky soon.
- vessenes 5 months ago
  
  Thanks. FWIW, it feels to me like this would be best as a global setting, not per-repo? Or, I guess it might be more aider-y to have sane defaults in the app and command line changes. Anyway, happily plugging away with the architect settings now!

pclmulqdq 5 months ago

Also for $36.83 compared to o1's $186.50

pzo 5 months ago
But also for $36.83 compared to DeepSeek R1 + claude-3-5 it's $13.29 and for latter "Percent using correct edit format" is 100% vs 97.8% for 3.7.
edit: would be interesting to see how combo DeepSeek R1 + claude-3-7 performs.
- tw1984 5 months ago
  
  is there any public info on why such DeepSeek R1 + claude-3-5 combo worked better than using a single model?
  
  5 replies →

VectorLock 5 months ago

How does it stack up against Grok3? I've seen some discussion that Grok3 is good for coding.

viraptor 5 months ago
It isn't available over api yet, as far as I know. So it can't be really tested independently.
- VectorLock 5 months ago
  
  The comparisons I saw I think were manual, so it makes sense it can run a whole suite- these were just some basic prompts and showed the difference in how the produced output ran.
pclmulqdq 5 months ago
Pro tip: It's hard to trust Twitter for opinions on Grok. The thumb is very clearly on the scale. I have personally seen very few positive opinions of Grok outside of Twitter.
- VectorLock 5 months ago
  
  I agree with you, and I hate to say this, but I saw them on LinkedIn. One purportedly used the same prompts to make a "pacman like" game and the results from Grok3 were at least better, assuming the post is on the up and up, better looking than o3-mini-high.
- _xtrimsky 5 months ago
  
  I thought Grok 2 was pretty bad, but Grok 3 is actually quite good. I'm mostly impressed by the speed of answering. But Claude is still the king of code.