Comment by bobosmrad

4 hours ago

looking at the claims i would say 5 humans would disagree even more than the llms

some of the claims where llms disagree:

"On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia."

"The slogan "Simon Go Back" was chanted in opposition to the Simon Commission in British India (1928–1930)."

"Neptune Deep will start delivering natural gas in 2027."

"A hotel villa in Kyrgyzstan displayed a sign stating 'no Jews, no dogs'."

"Donald Trump said that an attack on Iran was postponed at the request of Gulf allies."

If you are an LLM with a knowledge cutoff in the past and no access to a search tool the only correct answer to "On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia" is "this claim is impossible for me to verify". And that wasn't an option.

> "Neptune Deep will start delivering natural gas in 2027."

This is a "forward-looking statement", and presents special problems because you cannot really evaluate it until that date. You can only assign "likely or unlikely".

These "Facts" are interesting. "Neptune Deep will start delivering natural gas in 2027." for example is not a fact, its a prediction. "On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia." is less of a fact and more of a litmus test for which sources of information you trust.

  • So, rephrase it thus:

    "Russia, Ukraine, and multiple international news agencies reported that Ukrainian drones targeted Moscow on or around May 18, 2026."

    There are rarely pure first-order "facts" in the mathematical sense. There are evidence-backed claims with confidence levels. That does not make it "just a litmus test". It makes it a probabilistic factual claim with varying confidence levels - and this one happens to be verified and unambiguous.

  • Indeed. Real-world claims are somewhat messy. Some of the standard benchmarks, e.g. the questions in AVeriTeC, share similar characteristics.