Comment by jumploops

5 months ago

> "[..] in developing our reasoning models, we’ve optimized somewhat less for math and computer science competition problems, and instead shifted focus towards real-world tasks that better reflect how businesses actually use LLMs.”

This is good news. OpenAI seems to be aiming towards "the smartest model," but in practice, LLMs are used primarily as learning aids, data transformers, and code writers.

Balancing "intelligence" with "get shit done" seems to be the sweet spot, and afaict one of the reasons the current crop of developer tools (Cursor, Windsurf, etc.) prefer Claude 3.5 Sonnet over 4o.

30 comments

jumploops

eschluntz 5 months ago

Thanks! We all dogfood Claude every day to do our own work here, and solving our own pain points is more exciting to us than abstract benchmarks.

Getting things done require a lot of booksmarts, but also a lot of "street smarts" - knowing when to answer quickly, when to double back, etc

jasonjmcghee 5 months ago
Just want to say nice job and keep it up. Thrilled to start playing with 3.7.
In general, benchmarks seem to very misleading in my experience, and I still prefer sonnet 3.5 for _nearly_ every use case- except massive text tasks, which I use gemini 2.0 pro with the 2M token context window.
- jasonjmcghee 5 months ago
  
  An update: "code" is very good. Just did a ~4 hour task in about an hour. It cost $3 which is more than I usual spend in an hour, but very worth it.
- martinald 5 months ago
  
  I find the webdev arena tends to match my experience with models much more closely than other benchmarks: https://web.lmarena.ai/leaderboard. Excited to see how 3.7 performs!
LouisSayers 5 months ago
Could you tell us a bit about the coding tools you use and how you go about interacting with Claude?
- catherinewu 5 months ago
  
  We find that Claude is really good at test driven development, so we often ask Claude to write tests first and then ask Claude to iterate against the tests
  
  7 replies →

crowcroft 5 months ago

Sometimes I wonder if there is overfitting towards benchmarks (DeepSeek is the worst for this to me).

Claude is pretty consistently the chat I go back to where the responses subjectively seem better to me, regardless of where the model actually lands in benchmarks.

ben_w 5 months ago

> Sometimes I wonder if there is overfitting towards benchmarks
There absolutely is, even when it isn't intended.
The difference between what the model is fitting to and reality it is used on is essentially every problem in AI, from paperclipping to hallucination, from unlawful output to simple classification errors.
(Ok, not every problem, there's also sample efficiency, and…)
FergusArgyll 5 months ago

Ya, Claude crushes the smell test

bicx 5 months ago

Claude 3.5 has been fantastic in Windsurf. However, it does cost credits. DeepSeek V3 is now available in Windsurf at zero credit cost, which was a major shift for the company. Great to have variable options either way.

I’d highly recommend anyone check out Windsurf’s Cascade feature for agentic-like code writing and exploration. It helped save me many hours in understanding new codebases and tracing data flows.

throwup238 5 months ago
DeepSeek’s models are vastly overhyped (FWIW I have access to them via Kagi, Windsurf, and Cursor - I regularly run the same tests on all three). I don’t think it matters that V3 is free when even R1 with its extra compute budget is inferior to Claude 3.5 by a large margin - at least in my experience in both bog standard React/Svelte frontend code and more complex C++/Qt components. After only half an hour of using Claude 3.7, I find the code output is superior and the thinking output is in a completely different universe (YMMV and caveat emptor).
For example, DeepSeek’s models almost always smash together C++ headers and code files even with Qt, which is an absolutely egregious error due to the meta-object compiler preprocessor step. The MOC has been around for at least 15 years and is all over the training data so there’s no excuse.
- SkyPuncher 5 months ago
  
  I've found DeepSeek's models are within a stone's throw of Claude. Given the massive price difference, I often use DeepSeek.
  That being said, when cost isn't a factor Claude remains my winner for coding.
- rubymamis 5 months ago
  
  Hey there! I’m a fellow Qt developer and I really like your takes. Would you like to connect? My socials are on my profile.
  
  2 replies →
- bionhoward 5 months ago
  
  The big difference is DeepSeek R1 has a permissive license whereas Claude has a nightmare “closed output” customer noncompete license which makes it unusable for work unless you accept not competing with your intelligence supplier, which sounds dumb
  
  3 replies →
- tonyhart7 5 months ago
  
  I seen people switch from claude due to cost to another model notably deepseek tbh I think it still depends on model trained data on
ai-christianson 5 months ago

I'm working on an OSS agent called RA.Aid and 3.7 is anecdotally a huge improvement.
About to push a new release that makes it the default.
It costs money but if you're writing code to make money, it's totally worth it.
newgo 5 months ago

How is it possible that deepseek v3 would be free? It costs a lot of money to host models