Comment by jstummbillig
5 hours ago
It's all fairly lazy to a degree that is mildly confusing. I also feel this among other issues would have become obvious if they had bothered to include a human fact checker baseline (i.e. asked multiple human fact checkers the same questions).
I do not think it is "lazy". Those labels are ones that human fact-checkers have been using for a decade or more. I think those human fact-checkers use those terms knowing full well that there is overlap and ambiguity between them. So I think this study ends up mixing three effects: how LLMs interpret the claims as statements about the world, how LLMs reduce that to a four-category judgment, and the inherent ambiguities of those labels as natural language. It's a quantification of those three factors combined, but not powerful enough to distinguish their relative sizes.
I don't see how something being lazy for a decade makes it any less lazy. And lazy still seems right to me: They make a misleading point by omitting to collect and present important data. If the headline read "LLMs disagree on 67%, humans disagree on 75%" it would clearly project something very different.
Granted, there certainly are other unflattering adjectives one could have chosen to describe this instead.
Quick note on the second effect - how LLMs reduce that to a four-category judgment: On 21% of the claims at least two models provide polar-opposite verdicts (at least one model False, and at least one model True). This might be a better measurement of the strict disagreement than the 67% disagreement on the four-bucket rubric.