← Back to context

Comment by oooyay

5 days ago

What's really interesting in this comment chain is an observation I've expressed a lot more lately. When someone knows an LLM was involved they raise their expectations. I do it too in my own work and I have to remind myself things like "this bug would've also likely occurred with a human working at this level of complexity." The real question is did the operator arbitrarily and knowingly increase the level of complexity or is it appropriate for the task.

> The real question is did the operator arbitrarily and knowingly increase the level of complexity or is it appropriate for the task.

There's one major reason to have higher expectations for autonomous systems (of all kinds, not just LLM-powered) than for humans, at least those intended to be deployed at scale, and that's the scale. If a human makes a mistake, has biases, or even intentionally breaks the rules the impact of their actions is limited by the nature of them being a human, where something like an autonomous driving system, a coding agent, etc. is intended to be deployed by the thousands, millions, or more and any problematic behaviors happen at that scale.

There are obviously millions of bad drivers out there, but every one of the human ones is bad in different ways. If Waymo pushes a bad update there could be tens of thousands of "drivers" that suddenly become bad in identical ways.

Humans also have the ability to learn from our mistakes. The ones you'd want to have working for you usually don't make the same one twice. LLMs are pretty good at making the same mistake repeatedly, even the simplest things like basic math or counting letters.

And there’s good reason for that. Anthropic, OpenAI, Salesforce, and so on have aggressively marketed LLMs as better than humans at working. It’s no surprise when we find out something is build using an LLM, we expect it to match the marketing.

  • But what constitutes "better than humans at working"?

    Zero defects? Because you can always find at least one defect. But people don't naturally think statistically, so they grasp the thing that confirms their bias and then hang on tenaciously.

    You'll notice the incredible amount of vitriol resulting from a purely cosmetic bug (which, it turns out, results from a missing TERM env in the base image - Claude is very conservative when it can't determine utf-8 support with 100% certainty).