← Back to context

Comment by BiteCode_dev

5 hours ago

You got me good with this one.

But seriously, this is my main answer to people telling me AI is not reliable: "guess what, most humans are not either, but at least I can tell AI to correct course and it's ego won't get in the way of fixing the problem".

In fact, while AI is not nearly as a good as a senior dev for non trivial tasks yet, it is definitely more reliable than most junior devs at following instructions.

It's ego won't get in the way but it's lack of intelligence will.

Whereas a junior might be reluctant at first, but if they are smart they will learn and get better.

So maybe LLM are better than not-so-smart people, but you usually try to avoid hiring those people in the first place.

  • That's exactly the thing. Claude Code with Opus 4.5 is already significantly better at essentially everything than a large percentage of devs I had the displeasure of working with, including learning when asked to retain a memory. It's still very far from the best devs, but this is the worse it'll ever be, and it already significantly raised the bar for hiring.

    • > but this is the worse it'll ever be

      And even if the models themselves for some reason were to never get better than what we have now, we've only scratched the surface of harnesses to make them better.

      We know a lot about how to make groups of people achieve things individual members never could, and most of the same techiques work for LLMs, but it takes extra work to figure out how to most efficiently work around limitations such as lack of integrated long-term memory.

      A lot of that work is in its infancy. E.g. I have a project I'm working on now where I'm up to a couple of dozens of agents, and ever day I'm learning more about how to structure them to squeeze the most out of the models.

      One learning that feels relevant to the linked article: Instead of giving an agent the whole task across a large dataset that'd overwhelm context, it often helps to have an agent - that can use Haiku, because it's fine if its dumb - comb the data for <information relevant to the specific task>, and generate a list of information, and have the bigger model use that as a guide.

      So the progress we're seeing is not just raw model improvements, but work like the one in this article: Figuring out how to squeeze the best results out of any given model, and that work would continue to yield improvements for years even if models somehow stopped improving.

Key differences, though:

Humans are reliably unreliable. Some are lazy, some sloppy, some obtuse, some all at once. As a tech lead you can learn their strengths and weaknesses. LLMs vacillate wildly while maintaining sycophancy and arrogance.

Human egos make them unlikely to admit error, sometimes, but that fragile ego also gives them shame and a vision of glory. An egotistical programmer won’t deliver flat garbage for fear of being exposed as inferior, and can be cajoled towards reasonable output with reward structures and clear political rails. LLMs fail hilariously and shamelessly in indiscriminate fashion. They don’t care, and will happily argue both sides of anything.

Also that thing that LLMs don’t actually learn. You can threaten to chop their fingers off if they do something again… they don’t have fingers, they don’t recall, and can’t actually tell if they did the thing. “I’m not lying, oops I am, no I’m not, oops I am… lemme delete the home directory and see if that helps…

If we’re going to make an analogy to a human, LLMs reliably act like absolute psychopaths with constant disassociation. They lie, lie about lying, and lie about following instructions.

I agree LLMs better than your average junior first time following first directives. I’m far less convinced about that story over time, as the dialog develops more effective juniors over time.

  • You can absolutely learn LLMs strengths and weaknesses too.

    E.g. Claude gets "bored" easily (it will even tell you this if you give it too repetitive tasks). The solution is simple: Since we control context and it has no memory outside of that, make it pretend it's not doing repetitive tasks by having the top agent "only" do the task of managing and sub-dividing the task, and farm out each sub-task to a sub-agent who won't get bored because it only sees a small part of the problem.

    > Also that thing that LLMs don’t actually learn. You can threaten to chop their fingers off if they do something again… they don’t have fingers, they don’t recall, and can’t actually tell if they did the thing. “I’m not lying, oops I am, no I’m not, oops I am… lemme delete the home directory and see if that helps…”

    No, like characters in a "Groundhog Day" scenario they also doesn't remember and change their behaviour while you figure out how to get them to do what you want, so you can test and adjust and find what makes them do what you want and it, and while not perfectly deterministic, you get close.

    And unlike humans, sometimes the "not learning" helps us address other parts of the problem. E.g. if they learned, the "sub-agent trick" above wouldn't work, because they'd realise they were carrying out a bunch of tedious tasks instead of remaining oblivious that we're letting them forget in between each.

    LLMs in their current form need harnesses, and we can - and are - learning which types of harnesses work well. Incidentally, a lot of them do work on humans too (despite our pesky memory making it harder to slip things past us), and a lot of them are methods we know of from the very long history of figuring out how to make messy, unreliable humans adhere to processes.

    E.g. to go back to my top example of getting adherence to a boring, reptitive task: Create checklists, subdivide the task with individual reporting gates, spread it across a team if you can, put in place a review process (with a checklist). All of these are techniques that work both on human teams and LLMs to improve process adherence.