Comment by Intralexical

3 hours ago

> One experiment focused on 76 patients who arrived at the emergency room of a Boston hospital.

> In one case in the Harvard study, a patient presented with a blood clot to the lungs and worsening symptoms.

That's a single anecdotal fluke from the study, which is misleadingly used to represent the headlining percentages.

If you read the linked paper, it says the LLMs did not outperform any group of doctors in the most important cases:

> The median proportion of cannot-miss diagnoses included for o1-preview was 0.92 [interquartile range (IQR) 0.62 to 1.0], although this was not significantly higher than GPT-4, attending physicians, or residents.

And again, the bigger issue is that skimming nurse's notes and predicting the next tokens, as the study made the doctors do, is not how doctors diagnose medical conditions.

2 comments

Intralexical

arcfour 3 hours ago

But that's not what I was responding to. "Oh, all of the cases are probably just common colds, so it just guessed cold and was right by sheer luck" is not what happened in the article.

Intralexical 33 minutes ago

Do you know how examples work? Or methodology? The claim I made is that statistical accuracy percentage ≠ healthcare outcomes, and you will mislead yourself in dangerous ways if you believe a headline that implies they're interchangeable. Not that the model literally guessed common colds when the patients had... boneitis...
The lupus anecdote on its own is irrelevant to the whether the statistics are being interpreted in valid ways or not. Also, I said nothing about luck.