Comment by eschluntz

5 months ago

Thanks! We all dogfood Claude every day to do our own work here, and solving our own pain points is more exciting to us than abstract benchmarks.

Getting things done require a lot of booksmarts, but also a lot of "street smarts" - knowing when to answer quickly, when to double back, etc

12 comments

eschluntz

jasonjmcghee 5 months ago

Just want to say nice job and keep it up. Thrilled to start playing with 3.7.

In general, benchmarks seem to very misleading in my experience, and I still prefer sonnet 3.5 for _nearly_ every use case- except massive text tasks, which I use gemini 2.0 pro with the 2M token context window.

jasonjmcghee 5 months ago

An update: "code" is very good. Just did a ~4 hour task in about an hour. It cost $3 which is more than I usual spend in an hour, but very worth it.
martinald 5 months ago

I find the webdev arena tends to match my experience with models much more closely than other benchmarks: https://web.lmarena.ai/leaderboard. Excited to see how 3.7 performs!

LouisSayers 5 months ago

Could you tell us a bit about the coding tools you use and how you go about interacting with Claude?

catherinewu 5 months ago
We find that Claude is really good at test driven development, so we often ask Claude to write tests first and then ask Claude to iterate against the tests
- Kerrick 5 months ago
  
  Write tests (plural) first, as in write more than one failing test before making it pass?
  
  6 replies →