Elo scoring two years of Magic: The Gathering games

3 years ago (dylanlott.com)

51 comments

shakezula

I think unfortunately Elo is misapplied here.

Elo is appropriate for chess, where there is no initial game-state variance, and no built-in advantage for either competitor except who goes first; that can be addressed by averaging, by using the results of tournaments where the competitors swap colors, or simply by maintaining a separate Elo as white and as black.

Similarly for Starcraft you can track Elo separately for Terran/Zerg/Protoss. (Technically you would also need to do the same by map, but anyway...)

With MTG, you have a huge effect from the quality of the deck. Unless you have each player play with each deck, there's no way to de-convolute the quality of the deck vs the quality of the player. And if you did have that data, Elo couldn't leverage it -- you'd need a more sophisticated model to account for that statistical effect.

Then there's the game-state variance you allude to... Regardless of how good you are at MTG, and even how good your deck is, you're going to lose a lot of games due to mana flood / mana screw / etc. When that happens to either player, the outcome of the game does not contain useful information about skill. Of course if you sample enough games, you can still figure out what is skill and what is chance, but using Elo with low-count datasets is bound to be misleading because it is designed for games of pure skill, where game outcomes contain information about relative skill levels 100% of the time. Maybe you could establish some rules about what games are appropriate to use as indicators of relative skill, and which ones must be discarded?

Anyway it's an interesting idea. Here's related reading for the MMR score used in Magic Arena:

https://hareeb.com/2021/05/23/inside-the-mtg-arena-rating-sy...

jmgao 3 years ago
> With MTG, you have a huge effect from the quality of the deck. Unless you have each player play with each deck, there's no way to de-convolute the quality of the deck vs the quality of the player. And if you did have that data, Elo couldn't leverage it -- you'd need a more sophisticated model to account for that statistical effect.
How well a player chooses their deck is one of the factors that determines how good a player is. You can say the same thing about the other games: I'd probably have a better rating in chess if I didn't only play somewhat unsound gambits, and I'd definitely have a better rating in Starcraft if I didn't only do 2port wraith in TvZ.
- charlieyu1 3 years ago
  
  It is very rock-paper-scissors even at top level
- thatguy0900 3 years ago
  
  im not sure if its the same in magic, but when I played yugioh how well made a deck was was also just a indicator of how much money you had
- nwiswell 3 years ago
  
  I'm not so sure.
  A deck is something you have. A build order, or a chess opening, is something you _know_ and therefore more or less what I'd be comfortable calling skill.
  
  7 replies →
brightball 3 years ago
I actually came up with a play style variation that avoids the mana flood / screw and my son and I use it when we play. Honestly, I find it a lot more fun.
You split your deck into two stacks. One with land and one with everything else. For your starting hand, you take 3 land and 4 of everything else.
Each draw phase, you pick which stack you draw from.
That’s it. Everything else stays the same but mana floods / screws completely stop.
- reificator 3 years ago
  
  It’s good for teaching younger players who still have temper problems, but there’s only so much of the game you can experience this way. And don’t expect to get to advanced or expert strategies without the game balance falling apart.
  One knock on effect I’d predict is higher mana value cards would be substantially more playable. I expect a deck of walls + counterspells + removal + big finishers like the Eldrazi or even just Baneslayer Angel to be much more effective than it is now.
  On the other end of the spectrum super low to the ground aggro strategies also get a huge bonus by simply never having to draw a land again.
  Probably Storm (play a bunch of cheap spells, typically with a discount or with effects that give you mana when you cast a spell) gets a huge boost as well as they can ensure they never fizzle out. Once the engine is going they’ll always win unless they get countered.
  What lose out here are all the decks in the middle. The midrange, “fair” decks that are just trying to curve out with the best play each turn.
  And all that’s not counting the rules headache with cards like Oracle of Mul-Daya, Fact or Fiction, Treasure Hunt, or Dark Confidant. Which pile does my Maze of Ith go in? Cultivate? Sol Ring? Faceless Haven?
  With that said it’s also my personal opinion that variance just makes the game more enjoyable and widens the group of players you can compete against, as long as you have the emotional capacity to not take losses personally.
  
  5 replies →
Agentlien 3 years ago

The variance argument may be solid in general but I will say that mana flood and mana screw can be greatly alleviated through deck building and use of mulligan.
You don't often see it happen during high level play.
I used to be rather careless in how I planned the mana of my decks and rarely took a mulligan. I faced mana issues all the time. After putting more planning into my mana base and deciding on a careful strategy for when to take a mulligan I now rarely experience those issues. When I do it is mainly because I break my own rules out of greed when refusing to admit a hand with great cards is too low on mana.
nimih 3 years ago

I agree that Elo falls rather short for multiplayer games (the article's approach probably converges much more slowly, or fails to converge, than an approach which is built around supporting multiplayer contests, and the simplification for "board zaps" is likely just plain wrong--although that might be a limitation of how they recorded their games), but I don't think individual MTG games having a substantial amount of luck should really impact the usefulness of Elo (or similar systems such as Glicko). After all, Elo is just trying to find ratings which best predict a given game outcome, so the presence of good/bad draws should still be well-modeled by that idea, and in particular, for two given players (at a particular point in time and holding particular decks[0]), it stands to reason that you should be able to still find some pair of ratings Rx and Ry s.t. P(x beats y) = 1/(1+10^(Rx-Ry)/400).
That being said, the inherent randomness of MTG maybe means that in an ill-defined, abstract sense, it takes "more skill" to improve 100 Elo points in MTG than in Chess, because X% of your games have no meaningful decisions so you have fewer places to take advantage of your superior decision-making and, further, this probably has real implications for reasonable choices of K if you're running, say, MTG Arena, but the article is pretty clear that they're not doing anything especially rigorous when picking K in the first place, and honestly (IMO) it probably doesn't matter a whole lot if you're running a Friday night beer league with some friends or whatever.
[0] I agree with the sibbling comment that deck selection and deckbuilding is a large part of what magic players mean when they discuss skill, and it seems very reasonable to allow those things to be included in our model.
bitshiftfaced 3 years ago

Elo isn't misapplied here. It's just that when game results have a higher luck factor, you get a narrower distribution with shorter tails. You don't get those 2800 Elo players like in chess, who have virtually a 100% chance of beating nearly everyone everytime. The best and worst players tend more to the center, but there's still meaningfulness behind the score.
stonith 3 years ago
> Regardless of how good you are at MTG, and even how good your deck is, you're going to lose a lot of games due to mana flood / mana screw / etc.
What makes this different from blind build order choices in Starcraft? The greed > safe > rush > greed interactions often set one player ahead pretty arbitrarily in the very early game.
- apocolyps6 3 years ago
  
  It's more like if mineral placement was randomized at the start of a game, with players having uneven access.
  What you are describing with the build order exists also. Many Magic decks have a single game plan ("rush", etc) and can only minimally adapt between games in a match (by swapping cards with a 15-card sideboard). The degree to how uneven a matchup is can vary a lot, and some decks are hybridized so it doesn't just devolve in to rock/paper/scissors
grog454 3 years ago

If the purpose of Elo's system was to predict the outcome of a game between two players who have no or limited prior interaction, it can be "misapplied" to great effect. While unfairness and randomness (starting as black vs. white in Chess) can bias and increase the variance of that estimate, it is still better than tossing a coin.

prepend 3 years ago

So Elo scores used to be used to track players until 5-10 years ago. And I think you can still view your score if you view your profile details in Magic Online.

Back then they were trying to identify the best magic players in the world. I think it started at 1600.

In the mid 90s it was hard to get locally sanctioned games so I manually tracked games in my college town and used it as my first blog ever to post the scores of local players for a year or so. I wish the site was archived but I never kept a copy when I went to work and abandoned the site. I remember going through the formula to calculate the score and I couldn’t get it to work in excel or javascript so I manually calculated it out on paper for a few dozen games a week. I think back about how much time I could have saved if I just had a little more programming skills.

nimih 3 years ago

I track ratings and player records for a local tabletop board game league, and the question of how to choose and implement a rating system ends up being pretty interesting, with a lot of literature to read if you start following citations.

Even if you have well-defined, sequential 2-player matches, where a widely-used model[0] exists in the literature, there are a wide variety of ways to estimate player ratings from game results, which all have their own assumptions and various tuning parameters.[1] If your domain also includes team-based or multiplayer matches (or some other weird feature that you want to account for), you then get to decide whether you want to try and hack together something using Elo because it'll be "close enough"[5], or whether you want to try and use (or build!) a more sophisticated system which captures those nuances, such as Microsoft's TrueSkill(TM) system[6].

[0] The so-called "Bradley-Terry model" https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model

[1] Beyond Elo, which is described in the article, you have things like Glicko/Glicko-2[2], which still does incremental updates to participants after each match but try to track rating uncertainty/player growth in a more sophisticated way, to systems like Edo[3] and Whole-History rating[4], which attempt to find the maximum-likelihood rating for all players at all points in time simultaneously.

[2] https://en.wikipedia.org/wiki/Glicko_rating_system

[3] http://www.edochess.ca/

[4] https://www.remi-coulom.fr/WHR/WHR.pdf

[5] This is (obviously) the approach taken in the article, and IMO is probably the right answer unless you're a huge nerd who's interested in wasting a ton of time for not-much practical benefit.

[6] https://en.wikipedia.org/wiki/TrueSkill

philihp 3 years ago
If you're open to trying it out, I wrote an open-license alternative to TrueSkill: https://github.com/philihp/openskill.js, which has been ported to half a dozen languages. I'd love your feedback on it.
- nimih 3 years ago
  
  This looks sweet, I'll definitely play around with it a bit.

dhosek 3 years ago

Interesting take on the multiplayer ELO scoring. Another approach I've seen is to essentially treat an n-player game as an (n 2) set of two-player games, so, in the case where A>B>C>D, we have A wins three games (vs B, C and D), B wins two games (plus her loss against A), C wins one game (Plus losses against A and B) and D loses three games. The advantage of this over the model described in the article is that it more gracefully handles cases where there are what the author called multi-player zaps. In that instance, where say D is eliminated first and B and C simultaneously, A beats B, C and D still, but B and C are treated as being a tie¹ and both beat D.

⸻

1. A tie is not a strictly neutral event in many ELO scoring systems: usually it means that the higher-ranked player loses some ELO while the lower-ranked player gains some, just not as much as in a straight victory.

For team-based play (like with Spades), on Board Game Arena, they treat the partners as having tied which is, I think, incorrect. A better approach is probably to treat it as a match between two players where each team's ELO is the mean of their individual member's averages. The tie approach means that a strong player is penalized for having a weaker partner.

shakezula 3 years ago

That's an interesting approach for treating them as (n 2) sets, I might have to try modeling that out and applying that to the same data to see how it changes the numbers.
I have a number of table zaps recorded in subjective data, and I've considered a similar approach; treating a game recorded as a table zap as A winning a game against B, C, and D, however you're right, that doesn't handle the case of D being out, and then B and C being zapped at the same time. I think that's a subtle but important distinction.
Re: two player matches - That's a much better approach to my naive interpretation, definitely going to implement that sometime.

VirusNewbie 3 years ago

Is MtG one of the few games where being able to spend $10,000 USD puts you at a significant advantage (if playing vintage?). I think Vintage is my favorite format but it is becoming more expensive than high end watch collecting.

apocolyps6 3 years ago

I'd say the number of games like that is not few. 10k in equipment in most sports will give you an advantage over 1k in equipment.
It's more accurate to say that the entry cost into the Vintage format is 10k (or whatever the cost of the deck you want to play). You can't just throw money at the deck and increase its winrate unless your deck starts out suboptimal.
UncleMeat 3 years ago

Almost nobody plays vintage. Legacy is largely dead. The most expensive competitive format is modern, where competitive decks can run in the low thousands.
1000 or 2000 in equipment is absolutely normal for a lot of competitive endeavors.

snarkerson 3 years ago

Jeff Lynne is getting into some esoteric things to make music for.

kwertyoowiyop 3 years ago

I too am old.

wyldfire 3 years ago

At one time I wanted to try and independently measure decks and players. I modeled much of the game and created a rudimentary AI to play the decks. My goal was to be able to compare decks to tell you which one was "better". As I went, I thought of cool ways to compare play strategies too. It was a really fun project but in the end I succumbed - the model was getting more and more sophisticated but it was still far from complete. it's in my graveyard of cool projects that I got to 50-70% complete.

nwiswell 3 years ago
> My goal was to be able to compare decks to tell you which one was "better".
This would be a really out-of-the-box way to compensate for the deck quality bias I allude to in my other post -- normalize the effect of the deck on game outcomes by using a static "deck quality" score.
I suspect that coming up with halfway decent "deck quality scores" is an extremely difficult problem, though. It's not much of a leap from there to imagine using a computer to solve for the best possible deck in the format, the implications of which are terrifying for competitive Magic (and would be priceless to card speculators)
- PeterisP 3 years ago
  
  I think that it's not possible to normalize a single 'deck quality' score as the effectiveness of the deck depends on its opponents, you can have a deck that's good against some decks and weak against others in an intransitive manner; so the deck quality is conditional on the frequency of other decks in the 'competitor pool' i.e. the metagame. Game theory states that if there is not a single dominant deck (and I think that there would not in MtG) then there should be a Nash equilibrium of mixed strategies e.g. I pull out deck A with x% probability and deck B with y% probability, but with MtG rules that likely involves a distribution of many decks with different counterstrategies and counter-counter-strategies.
- shakezula 3 years ago
  
  Deck quality scores is a huge problem, but you're absolutely right, it quickly bubbles out into exponential problem spaces. For example, even among a given Commander deck selected for a given matchup of `x` players, that deck list could have changed each of the last `n` games.
  For this reason, the "assume a sphere with no friction" joke here is that deck selection, lock-in / mulligan processes, information asymmetry, and turn order, all being assumed to be equal and at that player's local maximum.
- wyldfire 3 years ago
  
  I was starting out with baby steps related to how well balanced your mana was. I would calculate the likelihood of a particular permanent being cast on turn #0-n. Never got to the point of creating a single index to score a deck overall. I had a long way to go. But I imagined taking some clever machine learning algorithms to help find suggested cards and swapping those in to create suggested decks.
  And I imagined this all as a service people would pay for, lol.
  
  2 replies →

mlex 3 years ago

There's already an Elo project for sanctioned events at the Grand Prix level and above: http://www.mtgeloproject.net/

It uses the public pairings and results that were published each round for events all the way back to the 90s. Unfortunately, there are less competitive MTG events these days, so most peoples' ratings stop in early 2020, but that's another topic altogether.

somat 3 years ago

The trueskill variant of the elo algorithm has a publicly available python module.

my understanding is it handles new players and teams better than straight elo.

My use for it was to keep track of winners during a mario kart tournament and see if it could predict the winners.

it did ok.

https://trueskill.org/

TheDudeMan 3 years ago

I'm assuming your Elo-calculation code is looking at one match at a time. If you want to level-up the mathematical precision, check out Bayeselo from Remi Coulom: https://www.remi-coulom.fr/Bayesian-Elo/

shakezula 3 years ago
> If you want to level-up the mathematical precision
I absolutely do! Thanks for the link, I’ll definitely check it out.
- TheDudeMan 3 years ago
  
  It is an implementation of the Bradley-Terry model. There are a few other implementations mentioned elsewhere in the comments.

once_inc 3 years ago

This project should consider using data gathered at 17lands to check how well it scores.

jjcon 3 years ago

Anyone else read the title as ELO scoring (as in electric light orchestra is scoring)

freediver 3 years ago

Came to say that MTG Arena is pretty cool for an old MTG player like me.