← Back to context

Comment by Intralexical

14 hours ago

Not only is the study testing something which only vaguely resembles how doctors diagnose patients, but isolated accuracy percentages are also a terrible way to measure healthcare quality.

If 90% of patients have a cold, and 10% have metastatic aneuristic super-boneitis, then you can get 90% accuracy by saying every patient has a cold. I would expect a probabilistic token-prediction machine to be good at that. But hopefully, you can see why a human doctor might accept scoring a lower accuracy percentage, if it means they follow up with more tests that catch the 10% boneitis.

What percentage of patients have blood clots in their lungs and a history of lupus, like the article described? That's not on the same level as a common cold at all.

  • > One experiment focused on 76 patients who arrived at the emergency room of a Boston hospital.

    > In one case in the Harvard study, a patient presented with a blood clot to the lungs and worsening symptoms.

    That's a single anecdotal fluke from the study, which is misleadingly used to represent the headlining percentages.

    If you read the linked paper, it says the LLMs did not outperform any group of doctors in the most important cases:

    > The median proportion of cannot-miss diagnoses included for o1-preview was 0.92 [interquartile range (IQR) 0.62 to 1.0], although this was not significantly higher than GPT-4, attending physicians, or residents.

    And again, the bigger issue is that skimming nurse's notes and predicting the next tokens, as the study made the doctors do, is not how doctors diagnose medical conditions.

    • But that's not what I was responding to. "Oh, all of the cases are probably just common colds, so it just guessed cold and was right by sheer luck" is not what happened in the article.