Comment by Aurornis
17 hours ago
When you read through the article it shows that the gap between doctors and LLMs actually disappeared (in terms of statistical significance) once both were allowed to read the full case notes.
The headline is quoting a number based on guessed diagnoses from nurse's notes. The LLM was happier to take guesses from the selected case studies than the doctors is my guess.
Not only is the study testing something which only vaguely resembles how doctors diagnose patients, but isolated accuracy percentages are also a terrible way to measure healthcare quality.
If 90% of patients have a cold, and 10% have metastatic aneuristic super-boneitis, then you can get 90% accuracy by saying every patient has a cold. I would expect a probabilistic token-prediction machine to be good at that. But hopefully, you can see why a human doctor might accept scoring a lower accuracy percentage, if it means they follow up with more tests that catch the 10% boneitis.
What percentage of patients have blood clots in their lungs and a history of lupus, like the article described? That's not on the same level as a common cold at all.
> One experiment focused on 76 patients who arrived at the emergency room of a Boston hospital.
> In one case in the Harvard study, a patient presented with a blood clot to the lungs and worsening symptoms.
That's a single anecdotal fluke from the study, which is misleadingly used to represent the headlining percentages.
If you read the linked paper, it says the LLMs did not outperform any group of doctors in the most important cases:
> The median proportion of cannot-miss diagnoses included for o1-preview was 0.92 [interquartile range (IQR) 0.62 to 1.0], although this was not significantly higher than GPT-4, attending physicians, or residents.
And again, the bigger issue is that skimming nurse's notes and predicting the next tokens, as the study made the doctors do, is not how doctors diagnose medical conditions.