Comment by whstl

16 days ago

This makes a lot of sense and explains why some people are so captivated by modern models, while others see progress as merely incremental.

I'm sure that explains some of it but I really don't think it explains most of the people who have been AI-pilled in the last nine months. There was no amount of context I could give GPT-4o that would make it a net benefit to use that for agentic development. I tried it with quite sophisticated prompt systems and much simpler ones, compendiums of code & business analysis and sparser ones. Yet it just wasted my time - still there were people using Cursor with that model and saying it was life changing. I didn't have that experience until Opus 4.5 - its possible I could have had it earlier but that was when I happened to try it again.

  • I think many of the people who have become "AI Pilled" (I'll include myself here) had it happen in the last 3 months. Even over the Christmas break, when the Wiggums loop got so much coverage - I still wasn't that blown away going into January/February- 50%+ of the time I'd just write the code myself. I like coding.

    But - I don't know if it was April, or May - but very recently - the coding harnesses paired with decent SOTA models like Opus 4.8/GPT 5.5 - just started showing a lot more consistency, and completeness, and sometimes downright clever behavior - that they started to become way more useful.

    Just one out of hundred+ examples - I gave Claude Code (Opus 4.8 High) a complex task that involved consul, vault - but I had neglected to give it sandbox permission to download from hashicorp.com. So - it created a entire test harness that simulated both the behavior of Vault and Consul - created all it's test cases, verified that they passed - and when I came back 40 minutes later said that it was all done.

    It's test harnesses so accurately simulated the behavior of Vault/Consul - that on first try - no refactoring whatsoever - all of the protobuf/AESGCM/API behavior (that has varied significantly between versions) - worked.

    This was something that would have taken me, someone super super familiar with the code and tools and APIs - a minimum of 3 solid days of work - and that would likely involve hundreds of attempts and refactors as I unwound all the weird encryption and packaging layers. It zero-shotted a full solution without having an API to test against

    If these agents actually have an actual test-harness - It's honestly hard to imagine what they can't do - subject only to imagination and budget at this point.

    Speaking personally - something changed Between January and, Let's say May - in which instead of seeing these things as mostly interesting technology demonstration, in which the flaws outweighed the benefits - I now genuinely think they are the future of programming. I'm dubious that I'll write much software manually in the future - beyond what I do for personal pleasure.

    • Asked to write a driver for macOS for some thing that didn't have macOS support, GPT-55 found Linux OS firmware on the vendors site, downloaded it, ran binwalk, extracted out the driver, got halfway to reimplementing it on macOS with barely any help from me. I did need to dive into it somewhat to get it across the line, but it showed some ingenuity along the way.

Which way do you think that goes? Are the ones who "get it" the ones who are captivated or see them as incremental?

  • I guess all of them?

    Some people "got" LLMs back in 2022, others needed it to evolve a bit.

    It's not unlike computers. I started using them back in the 90s and absolutely nobody I knew was interested, while today everyone carries one in their pockets...