Comment by ceejayoz
15 hours ago
> 60% of evaluated AI Scribe systems mixed up prescribed drugs in patient notes, auditors say
Not mentioned, as far as I can see: the comparative human mistake rate.
Having seen a lot of medical records, 60% sounds about normal lol.
Even if you had the same 60% error rate with humans the types of errors would be vastly different. Humans might make typos, or forget to include something, or even occasionally misremember some minor detail, but that's very different from BS AI just hallucinates out of nowhere. AI makes the kinds of mistakes no human ever would which means they can be extremely confusing and easy to catch or they can be something no human would even think to question or be looking out for because it makes no sense why AI would randomly (and confidently) say something so wrong.
Also, a machine needs to be better than a human to be accepted. I value humans intrinsically. I do not do the same for machines, I only care about the results they produce. If you give me a machine and a human that are both equally unreliable, I'll pick the human because he is a living creature worthy of my respect.
60% is insanely high and absolutely not the performance of human mistake rate. What charts are you reading?
This just says 60% of systems, but not the frequency for those systems. They were evaluating 20 systems, so for 12 systems there were mistakes in the prescriptions, but there isn't information about how common those mistakes were and it's hard to judge relative to a human system.
Outlandish claim, you better show some evidence. I've reviewed several medical charts too and the error rate is much lower than that - typically everything is dictated and transcribed which are fairly mature and accurate technologies
I was curious so I looked it up. Human doctors medication administration error rate is about 20%, but only about 8% excluding timing errors.
> Medication errors were common (nearly 1 of every 5 doses in the typical hospital and skilled nursing facility). The percentage of errors rated potentially harmful was 7%, or more than 40 per day in a typical 300-patient facility. The problem of defective medication administration systems, although varied, is widespread.
https://jamanetwork.com/journals/jamainternalmedicine/fullar...
> In all, 91 unique studies were included. The median error rate (interquartile range) was 19.6% (8.6-28.3%) of total opportunities for error including wrong-time errors and 8.0% (5.1-10.9%) without timing errors, when each dose could be considered only correct or incorrect
https://pubmed.ncbi.nlm.nih.gov/23386063/
But who is responsible is different.
(And if you already see 60% error rates in standard, pre-AI note taking, how does that not translate into many deaths and injury? At least one country's health system in the world should have caught that)
> And if you already see 60% error rates in standard, pre-AI note taking, how does that not translate into many deaths and injury?
Presumably most doctor's visits are a one-problem-one-solution-one-doctor type of thing. Done deal, notes are never read again. So that alone would explain why high rates of errors doesn't result in injuries or death very often.
Any injury or death caused by poor notes would have to occur when mistakes are done if you're followed for a serious chronic condition, or if you're handled by a team where effective communication is required.
> how does that not translate into many deaths and injury?
Because most of it is just written down and never looked at again until there’s a lawsuit or something.
The human who hits Submit or Approve is responsible.
The management human who offered the bad tool to the other human is responsible.
The robot cannot be responsible in place of us.
Yeah, the problem is the health system has no sacrificial goat if the AI note taker provides the wrong detail. The last thing we want is CTO being responsible!
I'm not convinced the CTO would be held accountable either.
I do wonder if people would be pushing AI so hard if their organizations were planning to hold them accountable for mistakes the AI made
I bet if that were the case we'd see a lot slower rollout of AI systems
This is not a popular view 'AI sucks at X but so do humans' but I think it is valid and we should take wins where we can, especially in healthcare. It is pretty clear that initial accuracy issues will become less and less of a problem as these technologies mature. This focus on accuracy now as a 'see it's bad' talking point though misses the real danger. Medical note takers have an exceptionally high chance of being hijacked for money and that is an issue we need to bring attention to now. They provide a real-time feed into a trillion dollar industry. Just roll that around in your head for a second. Insurance companies are going to want to tap that feed in real time so they can squeeze more money out. Drug makers are going to want to tap into that feed so they can abuse the data. Hospitals will want to tap into that feed to wring more out of doctors and boost the number of billable codes for each encounter. Very few entities are looking to tap into that feed to, you guessed it, help the patient. I am for these systems (and I have been involved in building them in the past) but the feeding frenzy of business interest that will obviously get involved with them is the thing we should be yelling and screaming about, not short-term accuracy issues.
> It is pretty clear that initial accuracy issues will become less and less of a problem as these technologies mature.
What do you base this on?
As someone who can both see the amazing things genAI can do, and who sees how utterly flawed most genAI output is, it's not obvious to me.
I'm working with Claude every day, Opus 4.7, and reviewing a steady stream of PRs from coworkers who are all-in, not just using due to corporate mandates like me, and I find an unending stream of stupidity and incomprehension from these bots that just astonishes me.
Claude recently output this to me:
"I've made those changes in three files:
- File 1
- File 2"
That is a vintage hallucination that could've come right out of GPT 2.0.
> That is a vintage hallucination that could've come right out of GPT 2.0.
That's because, despite the many claims to the contrary, the models haven't actually gotten any smarter. They are still just token prediction engines at the end of the day, without any understanding of what they are doing. That's why one should not rely on them.
> It is pretty clear that initial accuracy issues will become less and less of a problem as these technologies mature.
Does it?
Actually, yes. I have seen this specific industry mature from the very first fully automated note and kept tabs on it. The accuracy has increased massively and continues to increase due to several factors:
- Speech recognition and frontier models are continuing to get better at handling these types of conversations across accents, languages and specialties. The trend is obvious and clear here. Compare GPT 4 with Opus 4.7 and there is no contest. I'd even take GPT 5.4 nano over GPT 4 right now. So, yeah, they have been improving and, yeah, they will keep on improving.
- The pipelines these models are being built into are getting much more sophisticated than just 'transcribe with x and have GPT XX clean it up'. The people building these things aren't standing still. Even if they did keep using the same models the pipeline improvements would make things get better over time. Add that in with the model improvements and the gains are even greater.
- The companies doing this work are seeing more and more edge cases. Data matters. More and more practitioners are using these things. That means more to learn from. It also means more stories of things being wrong. If you cut your error rate in half but increase your customer base by 10x then you will be hearing about 5x the problems. We are seeing that right now.
- Providers are starting to adjust to the technology (repeat areas they know may cause trouble, adjust their audio setups, etc etc) Just like any technology both sides shift and it matters. The first users were champions. The second wave were mixed between champions, haters and people that didn't care yet. Now people are really starting to count on this technology. They know it isn't a fad and isn't going away and are actually using it day to day to get their work done. This means they are adjusting to it as needed to get to the next patient/note/etc.
This stuff is just a few years old and the gains are obvious and massive. They aren't going to suddenly stop improving. There is an argument that they will asymptotically approach some level of utility, but we are still gaining quickly right now.
60% is a normal human mistake rate? You can't be serious.