← Back to context

Comment by jiggawatts

4 hours ago

Many of the rows in that spreadsheet reference "current events", which models aren't expected to do much better at than a human making an educated guess! They all have cutoff dates either last year or early this year and know nothing about what happened in "April 2026".

This is doubly problematic because you evaluated earlier models like Gemini Pro 3 instead of 3.1, GPT 5.4 instead of 5.5, etc...

Given that it's only a thousand short questions, you should be able to re-run your test in about an hour with the latest models, so... why haven't you?

Similarly, LLM output is non-deterministic, so if you could get more interesting stats of your data set by repeating each question 'n' times for each model.

Two of the models used have retrieval capabilities and have access to newer information through search. The other three are parametric.

  • Comparing models with search tools to models without - when there's no option for "I am unable to answer this question without access to search" - doesn't make sense to me.

    • Agree about comparing models with and without search capabilities. Even the two models with search capabilities (Sonar Pro and Gemini) agree only on 58% of the claims.

  • The title mention "fact-checks", but "fact checking" is a process in which facts are checked against sources, not one where you are given a random fact and have to tell if it's true or false from your own memory. That's what is normally called a quiz game. So a more honest title for this research would be "Models answer differently to quiz questions".