Comment by nestorD
1 day ago
On alternative ways to measure LLM intelligence, we had good success with this: https://arxiv.org/abs/2509.23510
In short: start with a dataset of question and answer pairs, where each question has been answered by two different LLMs. Ask the model you want to evaluate to choose the better answer for each pair. Then measure how consistently it selects winners. Does it reliably favor some models over the questions, or does it behave close to randomly? This consistency is a strong proxy for the model’s intelligence.
It is not subject to dataset leaks, lets you measure intelligence in many fields where you might not have golden answers, and converges pretty fast making it really cheap to measure.
Interesting, but couldn't a model "cheat" in this task by being very good at telling model outputs apart? How far do you get with a classifier simply trained to distinguish models by their output?
It seems to me many models - maybe by design - have a recognizable style which would be much easier to detect than evaluating the factual quality of answers.
In theory, yes! If this metric ever becomes a widely used standard, one would have to start accounting for that...
But, in practice, when asking a model to pick the best answer they see a single question / answers pair and focus on determining what they think is best.
Doesn't that presume that one model dominates the other?
It presumes some models are better than others (and we do find that providing data with a wide mix of model strengths improves convergence) but it does not need to be one model, and it does not even need to be transitive.