Comment by narush

3 days ago

Hey HN, study author here. I'm a long-time HN user -- and I'll be in the comments today to answer questions/comments when possible!

If you're short on time, I'd recommend just reading the linked blogpost or the announcement thread here [1], rather than the full paper.

[1] https://x.com/METR_Evals/status/1943360399220388093

15 comments

narush

causal 3 days ago

Hey I just wanted to say this is one of the better studies I've seen - not clickbaity, very forthright about what is being claimed, and presented in such an easy-to-digest format. Thanks so much for doing this.

narush 3 days ago

Thanks for the kind words!

isoprophlex 3 days ago

I'll just say that the methodology of the paper and the professionalism with which you are answering us here is top notch. Great work.

narush 3 days ago

Thank you!

jsnider3 3 days ago

It's good to know that Claude 3.7 isn't enough to build Skynet!

JackC 3 days ago

(I read the post but not paper.)

Did you measure subjective fatigue as one way to explain the misperception that AI was faster? As a developer-turned-manager I like AI because it's easier when my brain is tired.

narush 3 days ago

We attempted to! We explore this more in the section Trading speed for ease (C.2.5) in the paper (https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf).
TLDR: mixed evidence that developers make it less effortful, from quantitative and qualitative reports. Unclear effect.

antonvs 3 days ago

Was any attention paid to whether the tickets being implemented with AI assistance were an appropriate use case for AI?

If the instruction is just "implement this ticket with AI", then that's very realistic in that it's how management often tries to operate, but it's also likely to be quite suboptimal. There are ways to use AI that help a lot, and other ways that hurt more than it helps.

If your developers had sufficient experience with AI to tell the difference, then they might have compensated for that, but reading the paper I didn't see any indication of that.

narush 3 days ago

The instructions given to developers was not just "implement with AI" - but rather that they could use AI if they deemed it would be helpful, but indeed did _not need to use AI if they didn't think it would be helpful_. In about ~16% of labeled screen recordings where developers were allowed to use AI, they choose to use no AI at all!
That being said, we can't rule out that the experiment drove them to use more AI than they would have outside of the experiment (in a way that made them less productive). You can see more in section "Experimentally driven overuse of AI (C.2.1)" [1]
[1] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf

igorkraw 3 days ago

Could you either release the dataset (raw but anonymized) for independent statistical évaluation or at least add the absolute times of each dev per task to the paper? I'm curious what the absolute times of each dev with/without AI was and whether the one guy with lots of Cursor experience was actually faster than the rest of just a slow typer getting a big boost out of llms

Also, cool work, very happy to see actually good evaluations instead of just vibes or observational stuies that don't account for the Hawthorne effect

narush 3 days ago
Yep, sorry, meant to post this somewhere but forgot in final-paper-polishing-sprint yesterday!
We'll be releasing anonymized data and some basic analysis code to replicate core results within the next few weeks (probably next, depending).
Our GitHub is here (http://github.com/METR/) -- or you can follow us (https://x.com/metr_evals) and we'll probably tweet about it.
- igorkraw 3 days ago
  
  Cool, thanks a lot. Btw, I have a very tiny tiny (50 to 100 audience ) podcast where we try to give context to what we call the "muck" of AI discourse (trying to ground claims into both what we would call objectively observable facts/évidence, and then _separately_ giving out own biased takes), if you would be interested to come on it and chat => contact email in my profile.
  
  1 reply →

yawnxyz 2 days ago

Does this reproduce for early/mid-career engineers who aren't at the top of their game?

narush 2 days ago

How these results transfer to other settings is an excellent question. Previous literature would suggest speedup -- but I'd be excited to run a very similar methodology in those settings. It's already challenging as models + tools have changed!