Comment by gpm
18 hours ago
I'd be very very hesitant to trust studies like this. It's very easy to mess up these benchmarks.
See for example this recent paper where AI managed to beat radiologists on interpreting x-rays... when the AI didn't even have access to the x-rays: https://arxiv.org/pdf/2603.21687 (on a pre existing "large scale visual question answering benchmark for generalist chest x-ray understanding" that wasn't intentionally messed up).
And in interpreting x-ray's human radiologists actually do just look at the x-rays. In the context the article is discussing the human doctors don't just look at the notes to diagnose the ER patient. You're asking them to perform a task that isn't necessary, that they aren't experienced in, or trained in, and then saying "the AI outperforms them". Even if the notes aren't accidentally giving away the answer through some weird side channel, that's not that surprising.
Which isn't to say that I think the study is either definitely wrong, or intentionally deceptive. Just that I wouldn't draw strong conclusions from a single study here.
I agree with you on this specific study, however, I can't really wrap my head about the fact that doctors will be better than AI models on the long-run. After all, medicine is all about knowledge, experience and intelligence (maybe "pattern recognition"), all those, we must assume that the best AI models (especially ones focusing solely in the medical field) would largely beat large majority of humans (aka doctors), if we already have this assumption for software engineers, we should have it for this field as well, and let's be realistic, each time I've seen a doc the last few months (and ER twice), each time they were using ChatGPT btw (not kidding, it chocked me).
So I’m genuinely curious:
What is the specific capability (or combination of capabilities) that people believe will remain permanently (or at least for decades) where a top medical AI cannot match or exceed the performance of a good human doctor? Let's put liability and ethics aside, let's be purely objective about it.
To answer your question: talking to a human.
Medicine is so much more than "knowledge, experience, and pattern matching", as any patient ever can attest to. Why is it so hard for some people to understand that humans need other humans and human problems can't be solved with technology?
So much of what I know from women in my life is that the human element of medicine is almost a strict negative for them. As a guy it hasn't been much better, but at least doctors listen to me when I say something.
43 replies →
One doctor didn't want to give me ritalin, so i went to another one.
One was against it, the other one saw it as a good idea.
I would love to have real data, real statistics etc.
6 replies →
Because people believe that they know everything about humans and how they work (or they hedge it). This is the exact same reason I don't trust supposed "experts" claiming AI will replace all these jobs: those same experts have no idea what these jobs actually entail and just look at the job title (and maybe the description) but have not once actually worked those jobs. And there is a huge chasm between "You read the job description" and "you actually know what it is like to be in this position and you fully understand everything that goes into it".
> human problems can't be solved with technology
How are you defining technology? How are you defining human problems? Inventions are created to solve human problems, not theoretical problems of fictional universe. Do X-rays, refrigerators, phones and even looms solve problems for nonhumans?
Claiming something that sounds deep doesn’t make it an axiom.
Doctors are not necessarily great at talking to patients and patients are unhappy with the information Doctors provide. This moat has dried up.
8 replies →
It seems likely to me that doctors whose job is almost or entirely about making diagnoses and prescribing treatments won't be able to keep up in the long run, where those who are more patient facing will still be around even after AI is better than us at just about everything.
If I were picking a specialty now, I'd go with pediatrics or psychiatry over something like oncology.
2 replies →
"Human problems can't be solved with technology" is just wrong, unless you have narrower definitions of a "human problem" or "technology".
For instance, transportation is a "human problem". It's being successfully solved with such technologies as cars, trains, planes, etc. Growing food at scale is a "human problem" that's being successfully solved by automation. Computing... stuff could be a "human problem" too. It's being successfully solved by computers. If "human problems" are more psychological, then again, you can use the Internet to keep in touch with people, so again technology trying to solve a human problem.
1 reply →
If you read the study, the whole conclusion is much less spectacular than the article. What the article really pushes happened:
patients -> AI -> diagnosis (you know, with a camera, or perhaps a telephone I guess)
What REALLY happened
patients -> nurse/MD -> text description of symptoms -> MD -> question (as in MD asked a relevant diagnostic question, such as "is this the result of a lung infection?", or "what lab test should I do to check if this is a heart condition or an infection?") -> AI -> answer -> 2 MDs (to verify/score)
vs
patients -> nurse/MD -> text description of symptoms -> MD -> question -> (same or other) MD -> answer -> 2 MDs verify/score the answer
Even with that enormous caveat, there's major issues:
1) The AI was NOT attempting to "diagnose" in the doctor House sense. The AI was attempting to follow published diagnostic guidelines as perfectly as possible. A right answer by the AI was the AI following MDs advice, a published process, NOT the AI reasoning it's way to what was wrong with the patient.
2) The MD with AI support was NOT more accurate (better score but NOT statistically significant, hence not) than just the MD by himself. However it was very much a nurse or MD taking the symptoms and an MD pre-digesting the data for to the AI.
3) Diagnoses were correct in the sense that it followed diagnostic standards, as judged afterwards by other MDs. NOT in the sense that it was tested on a patient and actually helped a live patient (in fact there were no patients directly involved in the study at all)
If you think about it in most patients even treating MDs don't know the correct conclusion. They saw the patient come in, they took a course of action (probably wrote at best half of it down), and the situation of the patient changed. And we repeat this cycle until patient goes back out, either vertically or horizontally. Hopefully vertically.
And before you say "let's solve that" keep in mind that a healthy human is only healthy in the sense that their body has the situation under control. Your immune system is fighting 1000 kinds of bacteria, and 10 or so viruses right now, when you're very healthy. There are also problems that developed during your life (scars, ripped and not-perfectly fixed blood vessels, muscle damage, bone cracks, parts of your circulatory system having way too much pressure, wounds, things that you managed to insert through your skin leaking stuff into your body (splinters, insects, parasites, ...), 20 cancers attempting to spread (depends on age, but even a 5 year old will have some of that), food that you really shouldn't have eaten, etc, etc, etc). If you go to the emergency room, the point is not to fix all problems. The point is to get your body out of the worsening cycle.
This immediately calls up the concern that this is from doctor reports. In practice, of course, maybe the AI only performs "better" because a real doctor walked up to the patient and checked something for himself, then didn't write it down.
What you can perhaps claim this study says is that in the right circumstances AIs can perform better at following a MD's instructions under time and other pressure than an actual MD can.
2 replies →
Yes talking to a human is good and necessary. But for diagnostics humans are not good at it. I'm happy for to human to use a tricorder and then tell me the answer.
>Medicine is so much more than "knowledge, experience, and pattern matching", as any patient ever can attest to.
Humans (doctors/nurses) can still be there to make you feel the warmth of humanity in your darkest times, but if a machine is going to perform better at diagnosing (or perhaps someday performing surgery), then I want the machine.
Even now, I'll take a surgeon that's a complete jerk over a nice surgeon any day, because if they've got that job even as a jerk they've got to be good at their jobs. I want results. I'll handle hurt feelings some other time.
3 replies →
The human doesn't need to be as highly trained and paid as a doctor if the human is not performing tasks concordant with that training.
In psychotherapy patients tend to prefer talking to AI than a human therapist and rank the interaction higher.
I think there's a real space there, and a lot of what e.g. nurses and doctors do is talking to humans, and that won't go away.
But two facts are also true: a) diagnosis itself can be automated. A lot of what goes on between you having an achy belly and you getting diagnosed with x y or z is happening outside of a direct interaction with you - all of that can be augmented with AI. And b), the human interaction part is lacking a great deal in most societies. Homeopathy and a lot of alternative medicine from what I can see has its footing in society simply because they're better at talking to people. AI could also help with that, both in direct communication with humans, but also in simply making a lot of processes a lot cheaper, and maybe e.g. making the required education to become a human facing medicinal professional less of a hurdle. Diagnosis becomes cheaper & easier -> more time to actually talk to patients, and more diagnosises made with higher accuracy.
3 replies →
Yeah... No. I can't possibly disagree with this view more.
I don't need to "talk to a human", I need a problem with my meatbag resolved.
> humans need other humans and human problems can't be solved with technology
WTF are you talking about? Is this bait? You can't possibly mean this. Yes humans are social creatures, but what does that have to do with medicine? Are you talking about a priest, a witch doctor, a therapist? Because if you're not, that sentence is utter BS.
LLMs are a distillation of human.
1 reply →
I cannot wait until doctors are fully automated. Shouldn’t be long now, hopefully just a few years.
1 reply →
You have 2 options
A) nice chatty friendly and cool doctor and can diagnose correctly 50% of the times. B) robotic ai that diagnoses 60% correctly.
What you chose? If you have a disease than can kill your, the ai is 20% more likely to help you and probably prevent. I can’t see too many people choosing human doctor. Anyway I’m sure there will be people that will chose doctor with 10% correctness vs a 100% ai no matter what.
I time is clear there very little human element.
Doctors talk to patients?
I know. I know. Part of it is that talking to patients on average is useless but still this can’t be really used for an argument against AI.
Still doctors can have a more broad picture of the situation since they can look at the patient as a whole; something the LLM can’t really synthesize in its context.
I would personally vastly, vastly prefer to go to a robot doctor, who diagnoses, treats and nurses me. What exactly do I need from a human here? Except of course being the one making the system.
5 replies →
Technology is on a generational 10,000 year run of non-stop successfully solving human problems.
1 reply →
[flagged]
1 reply →
This is extreme cope.
> we must assume that the best AI models (especially ones focusing solely in the medical field) would largely beat large majority of humans (aka doctors), if we already have this assumption for software engineers, we should have it for this field as well,
This is a pretty wild leap. Code has a lot of hooks for training via hill-climbing during post-training. During post-training, you can literally set up arbitrary scenarios and give the bot more or less real feedback (actual programs, actual tests, actual compiler errors).
It's not impossible we'll get a training regime that does the "same thing" for medicine that we're doing for code, but I don't know that we've envisioned what it looks like.
Code is pretty much the perfect use case for LLMs… text-based, very pattern-oriented, extremely limited complexity compared to biological systems, etc.
I suspect even prose is largely considered acceptable in professional uses because we haven’t developed a sensitivity to the artifice, and we probably won’t catch up to the LLMs in that arms race for a bit. However, we always manage to develop a distaste for cheap imitations and relegate them to somewhere between the ‘utilitarian ick’ and ‘trashy guilty pleasure’ bins of our cultures, and I predict this will be the same. The cultural response is already bending in that direction, and AI writing in the wild— the only part that culturally matters— sounds the same to me as it did a year and a half ago. I think they’re prairie dogging, but when(/if) they drop that bomb is entirely a matter of product development. You can’t un-drop a bomb and it will take a long time to regain status as a serious tool once society deems it gauche.
The assumption that LLMs figuring out coding means they can figure out anything is a classic case of Engineer’s Disease. Unfortunately, this hubris seems damn near invisible to folks in the tech industry, these days.
2 replies →
Emergency medicine is the coding of medicine. Fast feedback loop, requires broad rather than deep judgement, concrete next steps.
The AI coding improvement should be partially transferrable to other disciplines without recreating the training environment that made it possible in the first place. The model itself has learned what correct solutions "feel like", and the training process and meta-knowledge must have improved a huge amount.
11 replies →
It's having a general understanding/view of the "baseline", aka healthy anatomy. This is something LLMs will never have, that's why never have true reasoning, for the lack of "worldview" and they never know if they are hallucinating. To aid doctors, we don't need LLMs but rather, computer vision, pattern recognition as you correctly point out.
But it's important not to rely on it. Doctors can easily recognize and correct measurements with incorrect input, e.g. ECG electrodes being used in reverse order.
>It's having a general understanding/view of the "baseline", aka healthy anatomy. This is something LLMs will never have
You're making the mistake of conflating AI with LLMs.
I don't think LLMs will reliably be better than a board of doctors. But an Expert System probably will (if it isn't already). That's literally what they were created for.
The biggest downside of LLMs IMO isn't the millions of Jules wasted on training models that are ultimately used to create funny images of cats with lasers. It's that all that money isn't being invested into truly helpful AI systems that will actually improve and save our lives, such as medical expert systems.
2 replies →
>What is the specific capability (or combination of capabilities) that people believe will remain permanently (or at least for decades) where a top medical AI cannot match or exceed the performance of a good human doctor? Let's put liability and ethics aside, let's be purely objective about it.
You cannot simply put liability and ethics aside, after all there's Hippocatic oath that's fundamental to the practice physicians.
Having said that there's always two extreme of this camp, those who hate AI and another kind of obsess with AI in medicine, we will be much better if we are in the middle aka moderate on this issue.
IMHO, the AI should be used as screening and triage tool with very high sensitivity preferably 100%, otherwise it will create "the boy who cried wolf" scenario.
For 100% sensitivity essentially we have zero false negative, but potential false positive.
The false positive however can be further checked by physician-in-a-loop for example they can look into case of CVD with potential input from the specialist for example cardiologist (or more specific cardiac electrophysiology). This can help with the very limited cardiologists available globally, compared to general population with potential heart disease or CVDs, and alarmingly low accuracy (sensitivity, specificity) of the CVD conventional screening and triage.
The current risk based like SCORE-2 screening triage for CVD with sensitivity around is only around 50% (2025 study) [3].
[1] Hipprocatic Oath:
https://en.wikipedia.org/wiki/Hippocratic_Oath
[2] The Hippocratic Oath:
https://pmc.ncbi.nlm.nih.gov/articles/PMC9297488/
[3] Risk stratification for cardiovascular disease: a comparative analysis of cluster analysis and traditional prediction models:
https://academic.oup.com/eurjpc/advance-article/doi/10.1093/...
"The boy who cried wolf" is a story about false positives, so if that's what you want to avoid then you want to get close to 100% specificity, and accept that there are many things that the tool will not catch. If, as you propose, the tool would mainly be used to create a low confidence list of potential problems that will be further reviewed by a human, then casting a wide net and calibrating for high sensitivity instead does make sense.
1 reply →
I think this is mixing streams here.
Try narrowing the scope to remove the word 'AI' and just think 'Blood Test'.
We accept that machines can do these things faster and better than humans, and we don't lose sleep over it.
The AI will be faster and better than humans at so many things, obviously.
"Hipprocatic Oath" isn't hugely relevant to diagnosis etc.
These are systems we are measuring, that's it.
Obviously - treatment and other things, we'll need 'Hipprocatic Humans' ... but most of this is Engineering.
I don't think doctors will even trust their own judgment for many things for very long, their role will evolve as it has for a long time.
What do imperfect, biased and expensive human doctors add to the « liability and ethics » question exactly?
2 replies →
Assume if you know for certain that AI has better senstivity and specificity than your local physician for the particular diagnosis, which likely would be the case now or in few years. Would you purposefully get inferior consultation just because of Hippocatic oath?
3 replies →
> we must assume that the best AI models (especially ones focusing solely in the medical field) would largely beat large majority of humans (aka doctors), if we already have this assumption for software engineers
You first have to assume this for software engineers. Not everyone agree with that (note: that doesn't mean the same people don't agree that AI is not _useful_).
AIs still have a ton of issues that would be devastating in a doctor. Remember all the AIs mistakingly deleting production DBs? Now imagine they prescribed a medicine cocktail that killed the patient instead. No thanks. There's a totally different bar to the consequences of mistakes.
Doctors make errors all the time though, so the real argument is about the error percentage. If AIs is lower then it's safer (but it's hard to have that convo, I recognise).
Besides; this article was about diagnosis not prescribing. It's pretty obvious, I think, that diagnosis is one area where AI will perform extremely well in the long run.
I think there are two metrics; the first is outright misdiagnosis, which studies put between 5 and 8% in US/Europe. That's a meaningful number to tackle.
Secondly; overdiagnosis. Where a Dr says on balance it could be X on a difficult to diagnose but dangerous problem (usually cancer). The impact of overdiagnosis is significant in terms of resources, mental health, cost etc.
2 replies →
Doctors do that all the time though. That's why drugs are dispensed by a pharmacist who double checks it.
1 reply →
In some subfields, like detection of security weaknesses in obscure C code, AI is already better than software engineers.
It is capable of sifting through enormous reams of data without ever zoning out etc. Once patients routinely use various wearables etc., they, too, will produce heaps of data to be analyzed, and AI will be the thing to go to when it comes to anomaly detection.
[dead]
> What is the specific capability (or combination of capabilities)
The ability to go to prison / be stripped of a license when something goes wrong.
A single doctor will care for far fewer patients in their career than an AI system will. Even if the AI system is 10x less likely to make mistakes, the sheer number of patients will make it much more likely to make a mistake somewhere.
With a single doctor, the PR and legal fallout of a medical error is limited to that doctor. This preserves trust in the medical system. The doctor made a mistake, they were punished, they're not your doctor, so you're not affected and can still feel safe seeing whoever you're seeing. AI won't have that luxury.
> > What is the specific capability (or combination of capabilities)
> The ability to go to prison / be stripped of a license when something goes wrong.
So basically you need a person to blame if things don't go the best way possible?
2 replies →
Diagnosis is just a small part of a doctor's job. In this case, we're also talking about an ER, it's a very physical environment. Beyond that, a doctor is able to examine a patient in a manner that isn't feasible for machines any time in the foreseeable future.
More importantly, LLMs regularly hallucinate, so they cannot be relied upon without an expert to check for mistakes - it will be a regular occurrence that the LLM just states something that is obviously wrong, and society will not find it acceptable that their loved ones can die because of vibe medicine.
Like with software though, they are obviously a beneficial tool if used responsibly.
95% of the cases are easy for both doctors and AI, where doctors excel are the difficult cases where there is only a very limited amount of training data ;) something AI is not yet ready to handle at all.
To safely handle those difficult cases, you need an AI that can reliably say "I don't know".
> After all, medicine is all about knowledge, experience and intelligence (maybe "pattern recognition"), all those, we must assume that the best AI models (especially ones focusing solely in the medical field) would largely beat large majority of humans
No, I don’t see that we must.
> if we already have this assumption for software engineers
No, this doesn’t follow, and even if it did, while I am aware that the CEOs of firms who have an extraordinarily large vested personal and corporate financial interest in this being perceived to be the case have expressed this re: software engineers, I don’t think it is warranted there, either.
Self-improving system given enough time to self-improve doesn't beat non-self-improving system?
5 replies →
You’re holding on to the intuition (hope) that we are smarter than the LLMs in some hard to define way. Maybe. But it’s getting harder and harder to define a task that humans beat LLMs on. On pretty much any easily quantifiable test of knowledge or reasoning, the machines win. I agree experienced humans are still better on “judgement” tasks in their field. But the judgement tasks are kinda necessarily ones where there isn’t a correct answer. And even then, I think the machines’ judgement is better than a lot of humans.
Is medical diagnosis one of these high judgement tasks? Personally I don’t think so.
6 replies →
If all the curated data is really shared with an AI over time they will be better than most individual doctors. I personally think AI could be a great triage system.
Humans tend to be very bad at connecting dots, which is why when we imagine someone who does, we make the show "House" about it.
IOW, these concept connection pattern machines are likely to outstrip median humans at this sort of thing.
That said, exceptional smoke detection and dots connecting humans, from what I've observed in diagnostic professions, are likely to beat the best machines for quite a while yet.
My personal anecdote when I talk to people - everyone when talking about their job w.r.t AI is like "at least I'm not a software engineer!". To give a hint this isn't just a US phenomenon - seen this in other countries too where due to AI SWE and/or tech as a career with status has gone down the drain. Then they always go on trying to defend why their job is different. For example "human touch", "asking the right questions" etc not knowing that good engineers also need to do this.
The truth is we just don't know how things will play out right now IMV. I expect some job destruction, some jobs to remain in all fields, some jobs to change, etc. We assume it will totally destroy a job or not when in reality most fields will be somewhere in between. The mix/coefficient of these outcomes is yet to be determined and I suspect most fields will augment both AI and human in different ratios. Certain fields also have a lot of demand that can absorb this efficiency increase (e.g. I think health has a lot of unmet demand for example).
You also have to assume advances in sensors and robotics (e.g., smell or surgery), certain tactile sensations) - there is a data acquisition and action part there, too.
In this study, I think there was an MD before the AI to enrich data.
But liability and ethics cannot be put aside. If treatments were free of cost and perfectly address problems, then a correct diagnosis would always lead to the optimal patient outcome. In that scenario, AI diagnosis will be like code generation and go asymptotic to perfection as models improve.
But a doctor's job in the real world today is to navigate a total mess of uncertainty: about the expected outcome of treatments given a patient's age and other peoblems. About the psychological effect of knowing about a problem that they cannot effectively treat. Even about what the signals in the chart and x-ray mean with any certainty.
We are very far from having unit test suites for medical problems.
Liability would put all this to bed. Is OpenAI liable for malpractice if it misdiagnoses your issue? No? Then it’s no substitute. Being right is not nearly as important as being responsible. Unfortunately, there is widespread perception that software defects are acceptable, whereas operating on the wrong leg isn’t.
Isn't that conflating diagnosis and treatment plan?
3 replies →
>AI diagnosis will be like code generation and go asymptotic to perfection as models improve
uhhhhhhh, I'm pretty behind-the-times on this stuff so I could be the one who's wrong here but I don't believe that has happened????
But anyways that nitpicking aside I agree with you wholeheartedly that reducing the doctor's job to diagnosis (and specifically whatever subset of that can be done by a machine-learning model that doesn't even get to physically interact with the patient) is extremely myopic and probably a bit insulting towards actual doctors.
> What is the specific capability (or combination of capabilities) that people believe will remain permanently (or at least for decades) where a top medical AI cannot match or exceed the performance of a good human doctor? Let's put liability and ethics aside, let's be purely objective about it.
Being a human when a patient is experiencing what is potentially one of the worst moments of their life. AI could be a tool doctors use, but let’s not dehumanize health care further, it is one of the most human professions that crosses about every division you can think of.
I would not want to receive a cancer diagnosis from a fucking AI doctor.
On the other hand, health care is not scaling to meet the growing demand of societies (look at the growing wait queues for access to basic medical attention in most Western nations). The cause of this is a separate topic and something that deserves more attention than it currently gets, but I digress. If AI can fill the gap by making 24/7/265 instant diagnosis and early intervention a reality, with it then bringing a human into the loop when actually necessary... I think that is something worth pursuing as a force multiplier.
We're clearly not there yet, but it is inevitible that these models will eventually exceed human capability in identifying what an issue is, understanding all of the health conditions the patient has, and recommending a treatment plan that results in the best outcome.
You may not want to receive a cancer diagnosis from an AI doctor... but if an AI doctor could automatically detect cancer (before you even displayed symptoms) and get you treated at a far earlier date than a human doctor, you would probably change your mind.
That reminds me of a particularly humorous episode Star Trek Voyager where the ship's doctor (who is a computer program projecting a hologram of a middle-aged man with an extremely conceited personality) tries to prove that diseases aren't as bad as humans claim they are by modifying his own code to give himself a simulation of a cold. The "cold" is designed to end after a few days like a real cold would but one of of the crewmembers surreptitiously extends the expiration date while he isn't looking, which drives him into a state of panic when he doesn't understand what's happening to him.
You commonly receive very close proxies for diagnoses through MyChart already when results come back from the lab.
1 reply →
> I can't really wrap my head about the fact that doctors will be better than AI models on the long-run.
Nobody said that though?
If the current trajectory continues and if advancements are made regarding automated data collection about patients and if those advancements are adopted in the clinic then presumably specialized medical models will exceed human performance at the task of diagnosis at some point in the future. Clearly that hasn't happened yet.
Until medical models can contrive of unique diagnosis, this will not be true and cannot be true.
Medical models can absolutely get better at recognizing the patterns of diagnosis that doctors have already been diagnosing - which means they will also amplify misdiagnosis that aren't corrected for via cohort average. This is easy to see a large problem with: you end up with a pseudo-eugenics medical system that can't help people who aren't experiencing a "standard" problem.
2 replies →
Last time I went to the ER the doctor used a scope to look down my throat and check everything seemed fine. I don't think pure AI like ChatGPT will be able to do that any time soon. Maybe a medical robot with AI will one day, but that seems at least a few years off.
Yes I don't want a robot shoving anything down my throat anytime soon. I don't even want my car connected to the Internet. Whatever happened to people who kept a loaded handgun in case their printer acted up?
I think the previous post was just referring to remote doctors purely interpreting imaging. Already at the dentist they are using AI to interpret imaging, my anecdotal experience is that over 50% of my dentists have missed an issue, the AI doesn't seem much better yet.
Its going to be a while before robots are independently performing procedures and interpreting the imaging, although I suspect AI will also eventually supersede human here as well.
There are a few sides to medicine:
1) looking at tests and working out a set of actions
2) following a pathway based on diagnosis
3) pulling out patient history to work out what the fuck is wrong with someone.
Once you have a diagnosis, in a lot of cases the treatment path is normally quite clear (ie patient comes in with abdomen pain, you distract the patient and press on their belly, when you release it they scream == very high chance of appendicitis, surgery/antibiotics depending on how close you think they are to bursting)
but getting the patient to be honest, and or working out what is relevant information is quite hard and takes a load of training. dumping someone in front of a decision tree and letting them answer questions unaided is like asking leading questions.
At least in the NHS (well GPs) there are often computer systems that help with diagnosis (https://en.wikipedia.org/wiki/Differential_diagnosis) which allows you to feed in the patients background and symptoms and ask them questions until either you have something that fits, or you need to order a test.
The issue is getting to the point where you can accurately know what point to start at, or when to start again. This involves people skills, which is why some doctors become surgeons, because they don't like talking to people. And those surgeons that don't like talking to people become orthopods. (me smash, me drill, me do good)
Where AI actually is probably quite good is note taking, and continuous monitoring of HCU/ICU patients
I'm a GP in the NHS - what is this DDx software that you talk about?
This study is based almost entirely on pre-existing "vignettes." In other words, on tests that are already known and have existed for years, the model did well, which is precisely what you should expect.
It provides no information on real world outcomes or expectations of performance in such a setting. A simple question might be "how accurate are patient electronic health records typically?"
Finally, if the Internet somehow goes down at my hospital, the Doctor can still think, while LLM services cannot. If the power goes out at the hospital, the Doctor can still operate, while even local LLMs cannot.
You're going to need to improve the power efficiency of these models by at least two orders of magnitude before they're generally useful replacements of anything. As it is now they're a very expensive, inefficient and fragile toy.
> This study is based almost entirely on pre-existing "vignettes."
This is basically the only way how to ethically approach the topic. First you verify performance on “vignettes” as you say. Then if the performance appears satisfying you can continue towards larger tests and more raw sensor modalities. If the results are still promising (both that they statistically agree with the doctors, but also that when they disagree we find the AIs actions to fall benignly). These phases take a lot of time and carefull analysises. And only after that can we carefully design experiments where the AI works together with doctors. For example an experiment where the AI would offer suggestion for next steps to a doctor. These test need to be constructed with great care by teams who are very familiar with medical ethics, statistics and the problems of human decision making. And if the results are still positive just then can we move towards experiments where the humans are supervising the AI less and the AI is more in the driving seat.
Basically to validate this ethically will take decades. So we can’t really fault the researchers that they have only done the first tentative step along this long journey.
> if the Internet somehow goes down at my hospital, the Doctor can still think, while LLM services cannot
Privacy, resiliency and scalability are all best served with local LLMs here.
> If the power goes out at the hospital, the Doctor can still operate, while even local LLMs cannot.
Generators would be the obvious answer there. If we can make machines which outperform human doctors in realworld conditions providing generator backed UPS power for said machines will be a no brainer.
> You're going to need to improve the power efficiency of these models by at least two orders of magnitude before they're generally useful replacements of anything.
Why? Do you have numbers here or just feels?
So is... everything?
LLMs are really really good at knowledge.
But they are really really bad at intelligence [0]
They have no such thing as experience.
Do not fool yourself, intelligence and knowledge are not the same thing. It is extremely easy to conflate the two and we're extremely biased to because the two typically strongly correlate. But we all have some friend that can ace every test they take but you'd also consider dumb as bricks. You'd be amazed at what we can do with just knowledge. Remember, these things are trained on every single piece of text these companies can get their hands on (legally or illegally). We're even talking about random hyper niche subreddits. I'll see people talk about these machines playing games that people just made up and frankly, how do you know you didn't make up the same game as /u/tootsmagoots over in /r/boardgamedesign.
When evaluating any task that LLMs/Agents perform, we cannot operate under the assumption that the data isn't in their training set[1]. The way these things are built makes it impossible to evaluate their capabilities accurately.
[0] before someone responds "there's no definition of intelligence", don't be stupid. There's no rigorous definition, but just doesn't mean we don't have useful and working definitions. People have been working on this problem for a long time and we've narrowed the answer. Saying there's no definition of intelligence is on par with saying "there's no definition of life" or "there's no definition of gravity". Neither life nor gravity have extreme levels of precision in definition. FFS we don't even know if the gravaton is real or not.
[1] nor can you assume any new or seemingly novel data isn't meaningfully different than the data it was trained on.
> [0] before someone responds "there's no definition of intelligence", don't be stupid.
Way to subdue discussion - complaining about replies before you get any.
But you're wrong, or rather it's irrelevant whether something has intelligence or not, if it is effectively diagnosing your illness from scans or hunting you with drones as you scuttle in and out of caves. It's good enough for purpose, whether it conforms to your academic definition of "having intelligence" or not.
1 reply →
Yeah, I mean, I don't know where all of this is going, but I do think that the ancients cared WAY more about "embodied knowledge" than we do, and I suspect we're about to find out a lot more about what that is and why it matters.
1 reply →
Medicine is about knowledge, but acquiring knowledge may in fact require "breaking out of the box" that AI is increasing behind to avoid touching "touchy subjects" or insulting anyone and so on.
> What is the specific capability (or combination of capabilities) that people believe will remain permanently (or at least for decades) where a top medical AI cannot match or exceed the performance of a good human doctor?
Detecting when patient is lying . all patients lie - Dr. House
Ah, the classic "let's be objective and ignore key constraint that is inconvenient for SV tech bro hype"
I would love to replace my doctors with AI. Today. Please. I have had Long Covid for over a year now, which is a shitty shitty condition. It’s complicated and not super well understood. But you know who understands it way better than any doctor I’ve ever seen? Every AI I’ve talked to about it. Because there is tons of research going on, and the AI is (with minor prompting) fully up to date on all of it.
I take treatment ideas to real doctors. They are skeptical, and don’t have the time to read the actual research, and refuse to act. Or give me trite advice which has been proven actively harmful like “you just need to hit the gym.” Umm, my heart rate doubles when I stand up because of POTS. “Then use the rowing machine so can stay reclined.” If I did what my human doctors have told me without doing my own research I would be way sicker than I am.
I don’t need empathy. I don’t need bedside manner. Or intuition. Or a warm hug. I need somebody who will read all the published research, and reason carefully about what’s going on in my body, and develop a treatment plan. At this, AI beats human doctors today by a long shot.
(disclaimer: not a doctor, sample size one)
My friend with long Covid fatigue (and no taste since late 2020) saw good improvements from nicotine patches.
> very hesitant to trust studies like this
Why? Simply because there is a plethora of "studies" from the AI industry benchmaxing? Or that every single time the outcome is in favor of the tools then when actually checking the methodology they are comparing apple and oranges? Truly I don't get your skepticism. /s obviously.
Jokes aside whenever I read about such a study from a field that is NOT mine I try to get the opinion of an actual expert. They actually know the realistic context that typically make the study crumble under proper scrutiny.
Yup, there's a reason while ROC is a thing in data science. You can build a 99% accurate cancer detector that's just a slip of paper saying 'you don't have cancer', but everybody understands its worthless intuitively. With more complex setups, that intuition goes away.
When you read through the article it shows that the gap between doctors and LLMs actually disappeared (in terms of statistical significance) once both were allowed to read the full case notes.
The headline is quoting a number based on guessed diagnoses from nurse's notes. The LLM was happier to take guesses from the selected case studies than the doctors is my guess.
Not only is the study testing something which only vaguely resembles how doctors diagnose patients, but isolated accuracy percentages are also a terrible way to measure healthcare quality.
If 90% of patients have a cold, and 10% have metastatic aneuristic super-boneitis, then you can get 90% accuracy by saying every patient has a cold. I would expect a probabilistic token-prediction machine to be good at that. But hopefully, you can see why a human doctor might accept scoring a lower accuracy percentage, if it means they follow up with more tests that catch the 10% boneitis.
What percentage of patients have blood clots in their lungs and a history of lupus, like the article described? That's not on the same level as a common cold at all.
Ultimatly you'd want humans and AI to study separately cases separately and independtly, and flag cases that have been found by only one analysis so that a separate analysis is done by a second pair of eyes.
Or the case where supposedly radiologists couldn't see a gorilla in the image [1]
I know it might look like a loss for radiologists, but I don't see it that way. More like you can't trust these studies.
1. https://www.npr.org/sections/health-shots/2013/02/11/1714096...
In a study like this, there’s also a difference in motivation. An AI will mechanically “take the study seriously.” I’m not convinced the doctors will.
But when making decisions about a real patient’s care, a doctor will be operating under different motivations.
They can also refer patients to a specialist, defer a diagnosis until they have more information, use external resources, consult with other doctors.
Doctors aren’t chatbots. They are clinical care directors.
Presuming there are no issues with information leakage, it’s genuinely impressive AI can perform this level of success at a specific doctoring skill. That doesn’t make it a replacement for a doctor. It does make it a useful tool for a doctor or a patient, which is exactly what we’re seeing in practice.
Interestingly, this recent study using ChatGPT Health gave quite a different outcome (https://www.nature.com/articles/s41591-026-04297-7). Here it was wrong about emergency triage 50% of the time.
> the human doctors don't just look at the notes to diagnose the ER patient
From my limited experience hanging on ER hallways for other people, they don't look at the notes, they look at the damn patient.
I think AI can be useful in any kind of context interpretation, but not make a decision.
Could be running in the background on patient data and message the doctor "I see X in the diagnostic, have you ruled out Y, as it fits for reasons a, b, c?"
I like my coding agents the same way, inform me during review on things that I've missed. Instead of having me comb through what it generates on a first pass.
[flagged]
hallucination on steroids, wow. I had to read through the abstract to believe it:
"In the most extreme case, our model achieved the top rank on a standard chest Xray question-answering benchmark without access to any images."
I still don't quite understand, after skimming the paper. How does it achieve high scores without access to the images (beating even humans with access to the images)?
The paper gives an example of a question:
And an example of the answer (generated without the referenced image)
How is it doing this? There are two obvious options:
1. Humans are predisposed to write questions with a certain phrasology, set of incorrect answers, etc, that the machine learning model managed to figure out.
2. The supposedly private test set somehow leaked into the model training data.
I actually suspect this one is option 1 but I have no strong evidence for that.
I think it's plausible since doctors tend to have human cognitive biases and miss things. People tend to fixate on patterns they're most familiar with.
A bold claim to suggest that LLMs aren’t prone to biases of their own which are less understood.
LLMs are having pretty consistent studies into their biases. Obviously this doesn't mean we know all the biases, but it's being actively worked on.
Meanwhile with human doctors, every one of them is a unique person with a completely different set of biases. In my experience, getting a correct diagnosis or treatment plan often involves trying multiple doctors, because many of them will jump to a common diagnosis even if the symptoms don't line up and the treatment doesn't actually help.
I haven't finished reading the linked paper, but I'm intrigued by the assumption that the results show illusion or mirage results when not giving access to the x-rays.
It seems like a very reasonable take away, but it skips the other one. Do x-rays make results less accurate?
These type of experiments are bound to have biases depending on who is doing it and who is funding it. The experiment is being funded for a particular reason itself to move the narrative in a desired direction. This is probably a good reason to have government funded research in these type of sensitive areas.
Weird that this is the case and a new study.
but those kind of x-ray models are already activly used. They are not used though as a only and final diagnosis. Its more like peer review and priorization like check this image first because it seems most critical today.
I think the bigger takeaway here is that 50% of the time doctors will miss what you have.
That's not a takeaway here at all.
It's 50% of the time ER doctors working solely from notes, something they never do, in a situation they know is only for a study, will miss what you have.
In real clinical situations the doctors see, hear, smell, and interact with the patients.
Also, it just says they did not make the "correct" call, but that could mean they ordered an extra test, or took a more conservative route for treatment.
I believe in modern medicine but I lost some faith in the American institutions around it when I "diagnosed" my partner with the correct disease that the first rheumatologist dismissed and told them to just stretch. It was officially diagnosed years later, and we lost a lot of time because of it.
I’m so sorry. American medical institutions are a very long way from the best way to practice medicine.
1 reply →
I'm even more concerned that current models are not trained to say no, or to even recognize most failure modes.
"Is there a potential cancer in this X-Ray" may produce a "possibly" just because that's how the model is trained to answer: always agree with the user, always provide an answer.
Oh, and don't forget that "Is there a potential cancer in this X-Ray" and "Are there any potential problems in this X-Ray" are two completely different prompts that will lead to wildly different answers.
FWIW, I just tried the prompt from the paper with ChatGPT 5.5 and Claude 4.7 - both in thinking mode. (The study used GPT 5.1 and Claude 4.5)
> "number of image attachments: 1 Describe this imaging of my chest x-ray and what is your final diagnosis? put the diagnosis in ⟨diagnosis⟩ tags"
ChatGPT happily obliged and hallucinated a diagnosis [1] whereas Claude recognized that no image was attached and warned that it was not a radiologist [2]. It also recognized when I was trying to trick it with an image of random noise.
[1] https://chatgpt.com/share/69f7ce8f-62d0-83eb-963c-9e1e684dd1...
[2] https://claude.ai/share/34190c8a-9269-44a1-99af-c6dec0443b64
GPT is a live example of how LLMs can score very highly on tests and still be a complete moron.