← Back to context

Comment by budsniffer952

12 hours ago

>Putting that aside, I spend all day every day implementing very, very hard things right on the edge of what agents are (barely, sometimes) capable of

Is a single thing in your post demonstrable, or are we just supposed to take your word for it? Because all of this stuff sounds laughably subjective.

Most interesting things in software engineering are (laughably) subjective.

Just check out any conversation on dynamic vs static typing, talk to a Rust zealot, or ask a backend engineer if microservices were a mistake.

It's unfortunate, and it makes it hard to have proper discussions on these subjects. It would be worthwhile to figure out how we can have more constructive arguments.

  • "Have you ever noticed that anybody driving slower than you is an idiot, and anyone going faster than you is a maniac?" -- George Carlin

  • Thanks very much for saying this!

    Frankly, it feels like we should just sidestep arguments entirely and just all contribute our messy data/reports, and then see how we can meld all of it together, to find the best answers for our individual situations.

    Probably a good use of frontier AI, melding all of that!

It's all closed code, so I don't have a great way of showing you, but this is all pretty easy to test for yourself, and a good chunk of it is fairly objective:

On performance: just grab CC + Codex and try Opus 4.8 xhigh and GPT 5.5 xhigh side by side. Ask them a trivial question about something that's already in their context. Opus will churn for 30 seconds, and GPT 5.5 will respond in about three seconds. If you try the same with Fable 5 you'll notice way better adaptive thinking than Opus (it'll quicker than Opus, even on xhigh – although often still slower than 5.5).

I have many, many times done 'Opus xhigh, Opus max and GPT xhigh all tried to implement something' – Opus max is... hours and hours. Opus xhigh is usually ~1.5-2x GPT 5.5 xhigh. This feels like a pretty straightforward generalization of the first point. Again, just try racing three agents and see what you get.

As far as 'right on the edge of what they're able to do', my specific tasks don't matter. Just find something that no matter how hard you try, with however many agents or combinations thereof, with arbitrarily detailed plans, agents can't seem to implement without massively mistakes or a hollowing-out of 'the point' of the implementation... and then try it on the 'following generation' of models. I've been doing this repeatedly with coding agents since I turned aider into a CC-like coding agent in early 2025 (this was my second one, my first modern-style coding agent was in Jan 2025): https://github.com/Aider-AI/aider/pull/3781

A couple of examples of the latter thing that I tend to work on are database internals (indexes, query planner stuff, etc.; I built the DB in full before agents, it just works on it with me), very advanced UIs (try making a beautiful Rolex-like interactive visualization of the internals of a mechanical watch with Opus and see how far it gets – not very), and 'hardcore product questions' (all agents kinda suck at schema – Fable far less than prior ones). I have dozens and dozens of these that they can't do, though.

  • I can anecdotally back up that Opus takes a ridiculously long time to respond to basic questions. We’re talking, “you implemented this scoped feature on a web app, could you change the buttons to have a loading state like $EXAMPLE?” And it’ll be Discombobulating for 20+ seconds.

    I don’t remember this always being true.