Comment by mbesto

11 hours ago

How do you objectively tell whether a model "performs" better than another?

2 comments

mbesto

Not the original commenter but I work in the space and we have large annotated datasets with "gold" evidence that we want to retrieve, the evaluation of new models is actually very quantitative.

mbesto 5 hours ago

> but I work in the space
Ya, the original commenter likely does not work in the space - hence the ask.
> the evaluation of new models is actually very quantitative.
While you may be able to derive a % correct (and hence quantitative), they are by their nature very much not quantitative. Q&As on written subjects are very much subjective. Example benchmark: https://llm-stats.com/benchmarks/gpqa Even though there are techniques to reduce overfitting, it still isn't eliminated. So it's very much subjective.