Comment by ted_dunning
5 days ago
No. This isn't just bad math.
The problem here is that the weighting of the alternatives changes over time and the thing you are measuring may also change. If you start by measuring the better option, but then bring in the worse option in a better general climate, you could easily conclude the worse option is better.
To give a concrete example, suppose you have two versions of your website, one in English and one in Japanese. Worldwide, Japanese speakers tend to be awake at different hours than English speakers. If you don't run your tests over full days, you may bias the results to one audience or the other. Even worse, weekend visitors may be much different than weekday visitors so you may need to slow down to full weeks for your tests.
Changing tests slowly may mean that you can only run a few tests unless you are looking at large effects which will show through the confounding effects.
And that leads back to the most prominent normal use which is progressive deployments. The goal there is to test whether the new version is catastrophically worse than the old one so that as soon as you have error bars that bound the new performance away from catastrophe, you are good to go.
I mean, sure you could test over only part of the day, but if you do, that is, imho, bad math.
Eg. I could sum up 10 (decimal) and 010 (octal) as 20, but because they were the same digits in different numbering systems, you need to normalize the values first to the same base.
Or I could add up 5 GBP, 5 USD, 5 EUR and 5 JPY and claim I got 20 of "currency", but it doesn't really mean anything.
Otherwise, we are comparing incomparable values, and that's bad math.
Sure, percentages is what everybody gets wrong (hey percentage points vs percentage), but that does not make them not wrong. And knowing what is comparable when you simply talk in percentages, even more so (as per your examples).