Comment by deanc

4 hours ago

I've worked on projects in the airline and health industry which are highly regulated too. The regulations can be incredibly difficult to process and implement, and make sure you adhere to everything correctly. I've been involved in multiple scenarios where people have made false assertions about compliance or lack of. I'd still place a bet that the SOA models make _far_ less mistakes than humans.

They might make fewer mistakes, but they aren't evenly distributed. They don't use logic when making mistakes, it is gaps in the training data and now large of a span they have to bridge in the latent space. Just as they aren't smart like humans, they aren't stupid like humans. Don't mistake rate for quality.

  • Yeah, this starts to overlap with some autonomous vehicle stuff, where I like to say that the rate of errors is not the shape or distribution of errors.

    We have long historical experience and innate tools for detecting and mitigating errors made by humans. If we can't apply those to automation, then even fewer total mistakes may end up being a worse outcome.

For some reason, tons of people seem to be in camps at both extremes. It's either "AI sucks don't trust it!" or "AI is so much better than humans!"

But the most reasonable take, which I'm happy to see reflected in so many comments in this thread, is… use both.

Do an AI pass, and have humans verify, and vice versa. Let the humans drive the AI. Then the unique shortcomings of each party can be covered by the other's strengths.

  • AI review is never going to beat a fully resourced human review.

    It might beat an underresourced human review, on time, efficiency, cost metrics. But on the metric of accuracy, throwing unlimited humans at a problem will still beat throwing unlimited AI at it

    • That's an irrelevant comparison because cost is always a constraint, so there are not going to be unlimited AI or humans. The question is how to optimally combine them for a given cost.

  • > Do an AI pass, and have humans verify, and vice versa. Let the humans drive the AI.

    You can do that, sure. But doing so negates any improvements in speed the LLM brought. And at that point, you may as well just do it yourself to begin with.

    • When Google showed up on the scene I found I no longer needed to memorize basic syntax and other such things. If I couldn't remember on the fly, i'd just do a quick google search and move on. This freed space in my mind to instead focus on bigger & better things.

      I use GenAI tools when coding a lot, but I do not vibe code. I go through everything it generated, and we iterate. And yes, it doesn't save me a lot of time. But what it does do is free up mental capacity in a similar manner. But instead of syntax, it's more complicated patterns. Maybe I don't remember how to stitch something together, but i know it can be done. Instead of spending the time to look it up and then code it, I just tell it to do it for me.

    • Yeah, humans reviewing the AI review can only detect the false positives, where the LLM claims something is non-compliant and flags it for review/correction by a human or another agent. Human review can’t find the false negatives (true deficiencies not flagged) unless you do a full audit yourself to find whatever deficiencies the AI missed.

    • I feel like you're missing the point that it's more thorough to use both. Speed isn't the only factor that matters.

  • This makes sense, but a logical next step is to have one AI write code, and then have another AI, instead of humans, verify it.

    Or are current AIs too similar for that to be fruitful?

    • This is commonly known as "LLM-as-a-judge" and anecdotally multiple people I know who write code using OpenRouter or using multiple models say it's surprisingly effective. It's strange that there don't appear to be any major papers on it since ~early 2025, which at this point is basically ancient history.

not according. to my experience.

regulation questions. even the simple ones, AI gets all the time wrong. it wasn't Mythos, but other models like opus.

I can adjust the view on this topic if/when we get access to mythos.

>I'd still place a bet that the SOA models make _far_ less mistakes than humans.

Genuine question: your top coder seems to be producing the most error-free code from your perspective, has the deepest knowledge of the architecture and codebase, and is faster on the trigger than the others.

But your top coder has proven and verifiable dementia, where they will confidently assume the existence of apis and code that do not exist, mix up the purpose of others and forget other things, and you can't predict when and how they will introduce errors into the system or the severity of such errors.

Are you really comfortable letting this person with dementia generate most of your codebase in the airline and health industry?

I also hope you have an iron-clad agreement that prevents the model provider from doing silent updates because all your evidence of correctness you collected thus far goes out the window in that case.

Another genuine question:

You have witnessed a human coder and the AI you're using make the same important mistake. Assuming you do not have the time and resources to retrain, fine tume, and test your frontier model:

Who would you trust not to make the same mistake multiple times in the future after you have warned them that their job depends on it, the AI or the human?

  • Your top coder has guard rails in place to prevent him autonomously going free - right? This is how you should approach agentic development with LLMs. Like it or not, we are the final bastion, the gatekeepers. The hallucination thing I think is mostly overblown and from speaking to colleagues it seems to vary wildly depending on which model and harness you are using - always go for SOA. In the last 3 months I can count on one hand where it's done something wrong and that's primarily as I'm operating it with guard rails and giving it context.

    • >Your top coder has guard rails in place to prevent him autonomously going free - right?

      The parent is implying they would prefer an AI when working in the airline and health industry because it makes less errors. Read the comment again.

      They have not said, "Hey, I work in the airline and health industry and I'd love to use AI for a couple of the bullshit IT UIs we have as long as we can put guardrails on the AI to stay in its lane."

      I asked a yes or no question. The guardrails you can put to mitigate errors are the same guardrails pre-AI for the humans (tests, regressions, reviews). If you were wary of employing a top lead engineer with verifiable dementia prior to AI for a mission critical system, logic implies you should think twice giving that much responsibility to an AI as well.

      > The hallucination thing I think is mostly overblown

      Can you predict when and how the SOTA model will hallucinate? Yes or no. Can you predict the severity impact of that error beforehand? Yes or no.

      >from speaking to colleagues it seems to vary wildly depending on which model and harness you are using

      You have partially answered my question it would seem.

      5 replies →

> I'd still place a bet that the SOA models make _far_ less mistakes than humans.

Well too bad, the problem is that they also produce things much faster than humans so errors will compound quicker.

This stupid argument again. The number of mistakes _does not matter_. Get. This. In. Your. Head. The predictability of the _type_ of error is what matters. For LLMs and machine learning in general the error distribution is not what you would expect and it is not possible to predict either.