Comment by freedomben

18 hours ago

> Despite having access to my weight, blood pressure and cholesterol, ChatGPT based much of its negative assessment on an Apple Watch measurement known as VO2 max, the maximum amount of oxygen your body can consume during exercise. Apple says it collects an “estimate” of VO2 max, but the real thing requires a treadmill and a mask. Apple says its cardio fitness measures have been validated, but independent researchers have found those estimates can run low — by an average of 13 percent.

There's plenty of blame to go around for everyone, but at least for some of it (such as the above) I think the blame more rests on Apple for falsely representing the quality of their product (and TFA seems pretty clearly to be blasting OpenAI for this, not others like Apple).

What would you expect the behavior of the AI to be? Should it always assume bad data or potentially bad data? If so, that seems like it would defeat the point of having data at all as you could never draw any conclusions from it. Even disregarding statistical outliers, it's not at all clear what part of the data is "good" vs "unrealiable" especially when the company that collected that data claims that it's good data.

FWIW, Apple has published validation data showing the Apple Watch's estimate is within 1.2 ml/kg/min of a lab-measured Vo2Max.

Behind the scenes, it's using a pretty cool algorithm that combines deep learning with physiological ODEs: https://www.empirical.health/blog/how-apple-watch-cardio-fit...

  • The trick with the vo2 max measurement on the apple watch though is that the person can not waste any time during their outdoor walk and needs to maintain a brisk pace.

    Then there's confounders like altitude, elevation gain that can sully the numbers.

    It can be pretty great, but it needs a bit of control in order to get a proper reading.

  • The paper itself: https://www.apple.com/healthcare/docs/site/Using_Apple_Watch...

    Seems like Apple's 95% accuracy estimate for VO2 max holds up.

      Thirty participants wore an Apple Watch for 5-10 days to generate a VO2 max estimate. Subsequently, they underwent a maximal exercise treadmill test in accordance with the modified Åstrand protocol. The agreement between measurements from Apple Watch and indirect calorimetry was assessed using Bland-Altman analysis, mean absolute percentage error (MAPE), and mean absolute error (MAE).
    
      Overall, Apple Watch underestimated VO2 max, with a mean difference of 6.07 mL/kg/min (95% CI 3.77–8.38). Limits of agreement indicated variability between measurement methods (lower -6.11 mL/kg/min; upper 18.26 mL/kg/min). MAPE was calculated as 13.31% (95% CI 10.01–16.61), and MAE was 6.92 mL/kg/min (95% CI 4.89–8.94).
    
      These findings indicate that Apple Watch VO2 max estimates require further refinement prior to clinical implementation. However, further consideration of Apple Watch as an alternative to conventional VO2 max prediction from submaximal exercise is warranted, given its practical utility.
    

    https://pmc.ncbi.nlm.nih.gov/articles/PMC12080799/

    • That’s saying that they’re 95% confident that the mean measurement is lower than the treadmill estimate, not that the watch is 95% accurate. In other words they’re confident that the watch underestimates VO2 max.

> I think the blame more rests on Apple for falsely representing the quality of their product

There was plenty of other concerning stuff in that article. And from a quick read it wasn't suggested or implied the VO2 max issue was the deciding factor for the original F score the author received. The article did suggest many times over the ChatGPT is really not equipped for the task of health diagnosis.

> There was another problem I discovered over time: When I tried asking the same heart longevity-grade question again, suddenly my score went up to a C. I asked again and again, watching the score swing between an F and a B.

  • > There was plenty of other concerning stuff in that article.

    Yeah for sure, I probably didn't make it clear enough but I do fault OpenAI for this as much as or maybe more than Apple. I didn't think that needed to be stressed since the article is already blasting them for it and I don't disagree with most of that criticism of OpenAI.

  • The lack of self-consistency does seem like a sign of a deeper issue with reliability. In most fields of machine learning robustness to noise is something you need to "bake in" (often through data augmentation using knowledge of the domain) rather than get for free in training.

Well if it doesn't know the quality of the data and especially if it would be dangerous to guess then it should probably say it doesn't have an answer.

  • I don't disagree, but that reinforces my point above I think. If AI has to assume the data is of poor quality, then there's no point in even trying to analyze it. The options are basically:

    1. Trust the source of the data to be honest about it's quality

    Or

    2. Distrust the source

    Approach number 2 basically means we can never do any analysis on it.

    Personally I'd rather have a product that might be wrong than none at all, but that's a personal preference.

> Should it always assume bad data or potentially bad data? If so, that seems like it would defeat the point of having data at all as you could never draw any conclusions from it.

Yes. You, and every other reasoning system, should always challenge the data and assume it’s biased at a minimum.

This is better described as “critical thinking” in its formal form.

You could also call it skepticism.

That impossibility of drawing conclusions assumes there’s a correct answer and is called the “problem of induction.” I promise you a machine is better at avoiding it than a human.

Many people freeze up or fail with too much data - put someone with no experience in front of 500 ppl to give a speech if you want to watch this live.

  • I mostly agree with you, but I think it's important to consider what you're doing with the data. If we're doing rigorous science, or making life-or-death decisions on it, I would 100% agree. But if we're an AI chatbot trying to offer some insight, with a big disclaimer that "these results might be wrong, talk to your doctor" then I think that's quite overkill. The end result would be no (potential) insight at all and no chance for ever improving since we'll likely never get a to a point where we could fully trust the data. Not even the best medical labs are always perfect.

> What would you expect the behavior of the AI to be? Should it always assume bad data or potentially bad data? If so, that seems like it would defeat the point of having data at all as you could never draw any conclusions from it.

Well, I would expect the AI to provide the same response as a real doctor did from the same information. Which the article went over the doctors were able to.

I also would expect the AI to provide the same answer every time to the same data unlike what it did (from F to B over multiple attempts in the article)

OpenAI is entirely to blame here when they are putting out faulty products, (hallucinations even on accurate data are a fault of them).

I have been sitting and waiting for the day these trackers get exposed as just another health fad that is optimized to deliver shareholder value and not serious enough for medical grade applications

  • I don't see how they are considered a health fad, they're extremely useful and accurate enough. There are plenty of studies and real world data showing Garmin VO2Max readings being accurate to 1-2 points different to a real world test.

    There is this constant debate about how accurately VO2max is measured and its highly dependent on actually doing exercise to determine your VO2max using your watch. But yes if you want a lab/medically precise measure you need to do it a test that measures your actual oxygen uptake.