Comment by eis
6 hours ago
The Elo rating system measures relative performance to the other models. As the other models improve or rather newer better models enter the list, the Elo score of a given existing model will tend to decrease even though there might be no changes whatsoever to the model or its system prompt.
You can't use Elo scores to measure decay of a models performance in absolute terms. For that you need a fixed harness running over a fixed set of tests.
The relative and auto-scaling nature of Elo ranking feels like an advantage here.
Relative ranking systems extract more information per tournament. You will get something approximating the actual latent skill level with enough of them.
Advantage for what exactly though? I'm not saying Elo Ranking doesn't give any information. It just doesn't give the information that the OP's project claims to be able to give: that models get nerfed over time. You could extract this kind of information from the raw results of each evaluation round between two models, ignoring any new model entries and compare these over time but not from the resulting Elo scores with an ever changing list of models.
New models are on average better than older models, the average skill of the population of models increases over time and so you are mathematically guaranteed that any existing model will over time degrade in Elo score even though it didn't change itself in any way.
It's like benchmarking a model against a list of challenges that over time are made more and more difficult and then claiming the model got nerfed because its score declined.
Elo is good at establishing an overall ranking order across models but that's not what this is about.
To detect nerfing of a model, projects like https://marginlab.ai/trackers/claude-code/ are much much better (I'm not affiliated in any way).
Is that strictly true? ELO rankings do also inflate over time (looking at you, Chess GMs)
Elo systems often include one or more ways new points can enter the system. The system used by the European Go Federation has three ways iirc: 1. Cannot go under 100, 2. Cannot lose more than 100 points in one tournament, 3. Weaker player beating a stronger one (which is countered by the stronger player beating the weaker one, but it's not balanced: if two people only play each other forever and ever, both of their Elos will grow).