Comment by nimih

3 years ago

I track ratings and player records for a local tabletop board game league, and the question of how to choose and implement a rating system ends up being pretty interesting, with a lot of literature to read if you start following citations.

Even if you have well-defined, sequential 2-player matches, where a widely-used model[0] exists in the literature, there are a wide variety of ways to estimate player ratings from game results, which all have their own assumptions and various tuning parameters.[1] If your domain also includes team-based or multiplayer matches (or some other weird feature that you want to account for), you then get to decide whether you want to try and hack together something using Elo because it'll be "close enough"[5], or whether you want to try and use (or build!) a more sophisticated system which captures those nuances, such as Microsoft's TrueSkill(TM) system[6].

[0] The so-called "Bradley-Terry model" https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model

[1] Beyond Elo, which is described in the article, you have things like Glicko/Glicko-2[2], which still does incremental updates to participants after each match but try to track rating uncertainty/player growth in a more sophisticated way, to systems like Edo[3] and Whole-History rating[4], which attempt to find the maximum-likelihood rating for all players at all points in time simultaneously.

[2] https://en.wikipedia.org/wiki/Glicko_rating_system

[3] http://www.edochess.ca/

[4] https://www.remi-coulom.fr/WHR/WHR.pdf

[5] This is (obviously) the approach taken in the article, and IMO is probably the right answer unless you're a huge nerd who's interested in wasting a ton of time for not-much practical benefit.

[6] https://en.wikipedia.org/wiki/TrueSkill