Comment by jasondigitized

8 days ago

A single 8h task? I'm sorry, but that's just asking for trouble.

I don't understand how some of y'all use these things. I get garbage unless I give them very specific concrete tasks with as much context as possible. Anything that takes more than 30 min is usually a waste because the scope was too large.

  • Different people just have different concepts of what's garbage and what's not.

    There seems to be some kind of AI hysteria going on, with people becoming so enamoured with the AI that they accept anything it produces as if it's some gift from the gods, while others just reject it prima-facie.

    For example, the worst design I have seen recently was from a designer who pivoted into "vibe coding influencer". The worst code is from developers who were heavily into Clean Code a couple years ago and now half their PRs is unused dead code.

    • “One man’s trash is another man’s treasure.” takes a new meaning in today’s agentic coding world.

  • I had good experiences doing multi-hour refactoring/housekeeping tasks that basically consisted of applying the same steps and rules n times.

    Worth noting, a significant chunk of those runs involved the agent waiting for the compiler, linters, type checks, and test suites, as well as updating journals. It’s not the agent sputtering out code for eight hours straight.

    And naturally I spend more time on manual verification in the end as much less of it is happening during the coding process.

    • > that basically consisted of applying the same steps and rules n times.

      Why use a non-deterministic, possibly hallucinatory, definitely expensive, LLM when it sounds like a codemod is the perfect solution for this?

      8 replies →

    • > ... applying the same steps and rules n times

      I do this too, with a document written for this purpose.

      > ... a significant chunk of those runs involved the agent waiting for the compiler, linters, type checks, and test suites, as well as updating journals.

      That is a good point. I'm mostly using C, which seemingly compiles in O(1) time, so I could imagine a large C++ or Rust codebase taking much longer to iterate simply due to compilation times.

      2 replies →

  • Clear winner's circle. Clear objective. Clear scope.

    Clear evaluation function for an objective metric if they are making progress or regressing.

    Evaluation function is computed, not llmed.

    Ontology of potential actions clearly specified.

    Accurate inventory of the current status qou.

    Clear enumeration of options from status quo towards the winner's circle.

    Waypoint objectives with similarly concrete evaluations of pass/fail, or on target off target.

    It's the same thing when leading a large organization to actually hit a goal. There's randomness every turn away from your mind, so the more constrained the options, the more likely you are to hit the target. The consequence is if you're wrong about the plan then with people you're fucked. Morale will plummet. With AIs, they are so nerfed emotionally now, you clear context and start again.

    I did enjoy Sonnet 4 when they would swear randomly and become sullen or wax desperately. That would at least cause pushback against a bad plan.

  • Fable promised better at long running tasks.

    Parent post have a goal of "..see how it will perform.."

    There is nothing wrong with experimenting with something new.

  • This is my fucking life at work right now. I look forward to the weekends. I've never been truly inconvenienced by shitty devs because they're often too lazy to really spam me with bad code, but now they are all free to do so. I spent so much time today writing guardrail markdown files when these people SHOULD HAVE BEEN ABLE TO REVIEW THE OUTPUT AND KNOW THAT IT WAS BAD.

    It truly is the age of the 90 IQ software engineer. They've never had it better.

    • As if meetings weren't bad enough already, I now have to sit through an informal introduction to the model of the week and its personality characteristics and how quickly it burnt through one subscription's token allotment or whatever and the latest tweaks on the magic markdown files. Luckily I've only had a couple changes sent my way so far, which weren't much different than just getting a bug report to debug and fix myself. I will need to get into risky options gambling or something so I can go start my farm early, if it keeps going this way. Even supposing it all works correctly, I don't see how it is in any way enjoyable, satisfying, or fulfilling.

  • You have to build up a context, or otherwise seed the memory, to get anything useful out of these LLMs on a large or existing project.

  • If you're giving it 8 hours of stuff to create with a template (e.g. slop forking) that's not a big deal. Letting it run for 8 hours to debug a weird failure also tends to work out.

Indeed, according to METR, Mythos only achieved an 80% success rate with 3 hour tasks. https://metr.org/time-horizons/

  • I use both Opus and Fable on tasks that are well beyond "things that would take a human 3 hours"

    It fails all the time - as in it ends up doing something I want to change.

    But this doesn't actually matter - if it takes 3 or 4 iterations on something that would have taken me a week it might be a day of human work, but it's still 5 times better than doing it by hand.

    • This seems like the obvious correct frame of mind with which to approach these tools. If it works for three hours on a task that would have taken me three work weeks, and 20% of the time it gets the task wrong, then I can just ask it to do it again with adjusted instructions. It will be much more likely to get it right the same time, and I’m still ahead of where I would have been by 14 days and 2 hours.

      1 reply →

This sounds like classic "you're using it wrong", if they had said it was done in smaller tasks you would very likely have people here saying that was wrong too.

My record for a single uninterrupted session (albeit with Codex, not Claude) is 80+ hours. It was very productive, too.

The trick is having large, extensive test suites and forcing the agent to run them regularly.

  • So I guess that a lot of those 80 hours were spent running the test suite between changes?

    • Yep. I should add that the current crop of models is much more tolerant of something like this, compared to where we were a year ago - as in, they are quite willing to wait for a long time for the test or profiling run to finish without giving up on it, if the instructions make it clear that this is normal and expected.

  • An agent can’t have an “uninterrupted session” if you have to be “forcing” it do stuff.

    • "Forcing" here basically means giving initial instructions that clearly require passing the tests as a condition of finishing the work. The agent still works uninterrupted.

if there're some specific tests/evals to satisfy that an agent can test by itself, it can easily iterate for hours. And this time also includes running those tests/evals, which may not be small.