Comment by kostaj

3 hours ago

Two of the five models used (Gemini+Search and Sonar Pro) have retrieval capabilities and used search when classifying the claims. The disagreement between them is still quite significant - 42%.

Here are those disagreements:

https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...

One example:

Researchers estimate that the average person ingests about 5 grams of plastic per week, which is approximately the weight of a credit card.

Gemini retrieval: Misleading

Sonar pro: Mostly True

  • Internally the statement is perfectly true: some researchers did estimate this, and the credit card is a fair proxy for a 5g mass.

    Was the research flagrantly incorrect? Yes. But that does not affect the truth of the statement.