Comment by jll29

2 years ago

> Dan at least started to provide actual evidence and criteria by which he would score results, but even he only looked at 5 examples. Which really is a small sample size to make any general claims.

US NIST, in their annual TREC evaluation of search systems in the scientific/academic world, use sets of 25 or 50 queries (confusingly called "topics" in the jargon).

For each, a mandated data collection is searched by retired intelligence analysts to find (almost) all relevant result, which are represented by document ID in general search and by a regular expression that matches the relevant answer for question answering (when that was evaluated, 1998-2006).

Such an approach is expensive but has the advantage of being reusable.