As far as I understand, this is exactly how ELO scores work. If a more capable show up and starts beating all the other models, it literally takes ELO points from everyone else.
It depends what you use as an anchor. If the anchor is a fixed model, you’re right. If the anchor is updated to a better model over time, then the elo of historical models degrades, right?
As far as I understand, this is exactly how ELO scores work. If a more capable show up and starts beating all the other models, it literally takes ELO points from everyone else.
https://en.wikipedia.org/wiki/Elo_rating_system
There is an instance of this in the chart. In 2025-06-24 when Gemini-2.5-pro shows up. As you can see, the ELO of the others do not drop.
Depends on the test design; is an agent competing against other agent in a given match, or against a test? Plus! Does the test's ELO fluctuate?
Yes, that is in fact how Elo can work[0]. There are quite many ways Elo systems can work.
[0]: https://en.wikipedia.org/wiki/Elo_rating_system
It depends what you use as an anchor. If the anchor is a fixed model, you’re right. If the anchor is updated to a better model over time, then the elo of historical models degrades, right?