← Back to context

Comment by anotherpaulg

5 months ago

Using up to 32k thinking tokens, Sonnet 3.7 set SOTA with a 64.9% score.

  65% Sonnet 3.7, 32k thinking
  64% R1+Sonnet 3.5
  62% o1 high
  60% Sonnet 3.7, no thinking
  60% o3-mini high
  57% R1
  52% Sonnet 3.5

It's clear that progress is incremental at this point. At the same time Anthropic and OpenAI are bleeding money.

It's unclear to me how they'll shift to making money while providing almost no enhanced value.

  • Yudkowsky just mentioned that even if LLM progress stopped right here, right now, there are enough fundamental economic changes to provide us a really weird decade. Even with no moat, if the labs are in any way placed to capture a little of the value they've created, they could make high multiples of their investors' money.

    • Like what economic changes? You can make a case people are 10% more productive in very specific fields (programming, perhaps consultancy etc). That's not really an earthquake, the internet/web was probably way more significant.

      17 replies →

    • It's an echo chamber.

      It is - what? - a fifth anniversary of "the world will be a completely different place in 6 months due to AI advancement"?

      "Sam Altman believes AI will change the world" - of course he does, what else is he supposed to say?

      1 reply →

    • Yep totally agree. It will also depend who captures the most eyeballs.

      ChatGPT is already my default first place to check something, where it was Google for the previous 20+ years.

      3 replies →

    • With no moat, they aren't placed to capture much value; moats are what stops market competition from driving prices to the zero economic profit level, and that's even without further competition from free products that are being produced by people who aren’t even trying to support themselves in the market you are selling into, which can make even the zero economic profit price untenable.

      3 replies →

    • Oh really? How are these changes supposed to look like? Who will pay up essentially? I don't really see it, aside from the m$ business case of offering AI as a guise for violating privacy much harsher to better sell ads.

Paul, I saw in the notes that using claude with thinking mode requires yml config updates -- any pointers here? I was parsing some commits, and I couldn't tell if you only added architect support through openrouter. Thanks!

Also for $36.83 compared to o1's $186.50

  • But also for $36.83 compared to DeepSeek R1 + claude-3-5 it's $13.29 and for latter "Percent using correct edit format" is 100% vs 97.8% for 3.7.

    edit: would be interesting to see how combo DeepSeek R1 + claude-3-7 performs.

How does it stack up against Grok3? I've seen some discussion that Grok3 is good for coding.

  • It isn't available over api yet, as far as I know. So it can't be really tested independently.

    • The comparisons I saw I think were manual, so it makes sense it can run a whole suite- these were just some basic prompts and showed the difference in how the produced output ran.

  • Pro tip: It's hard to trust Twitter for opinions on Grok. The thumb is very clearly on the scale. I have personally seen very few positive opinions of Grok outside of Twitter.

    • I agree with you, and I hate to say this, but I saw them on LinkedIn. One purportedly used the same prompts to make a "pacman like" game and the results from Grok3 were at least better, assuming the post is on the up and up, better looking than o3-mini-high.

    • I thought Grok 2 was pretty bad, but Grok 3 is actually quite good. I'm mostly impressed by the speed of answering. But Claude is still the king of code.