← Back to context

Comment by GregorStocks

9 days ago

You wouldn't really need a _ton_ of games to get plausible data, but unfortunately today each game costs real money - typically a dollar or more with my current harness, though I'm hoping to optimize it and of course I expect model costs to continue to decline over time. But even reasonably-expensive models today are making tons of blunders that a tournament grinder wouldn't.

I'm not trying to compute a chess-style "player X was at 0.4 before this move and at 0.2 afterwards, so it was a -0.2 blunder", but I do have "blunder analysis" where I just ask Opus to second-guess every decision after the game is over - there's a bit more information on the Methodology page. So then you can compare models by looking at how often they blunder, rather than the binary win/loss data. If you look at individual games you can jump to the "blunders" on the timeline - most of the time I agree with Opus's analysis.

Very cool project. I would like to caution against confidence in the claim that a ton of games wouldn't be necessary for plausible data. I also am not convinced that anyone but human experts in particular matchups are really in an appropriate epistemic position to say much in sufficiently complex magic formats. Game wins are probably a better indicator on average.