Comment by btilly

13 years ago

In order to do anything in practice, you have to make assumptions. With A/B testing your key assumption is that differences in version behavior are relatively stable over time, even if absolute behavior changes. This is not always true - for instance a version that makes references to Christmas will perform differently before and after Dec 25. But it tends to be true often enough that people can rely on that assumption. (With no simplifying assumptions of any kind you get analysis paralysis - that would be bad.)

Now to reply point by point.

1. Real world performance varies over time...

With the assumption that I provided above, A/B testing merely needs to collect enough data to detect real differences in user behavior. After a few days you may not know what the average long-term rate is, but you have strong evidence that one is better.

I have absolutely no idea where you got the impression that A/B tests typically run for months, or that you can't touch anything else while it is running. How long they need to run varies wildly (depends on traffic volume, conversion rates, etc), but for many organizations a week is a fairly long test.

2. (This is really a special case of #1 - but a very, very important special case...

Contrary to your point, A/B tests DO NOT assume that the distribution is stationary. They make the much weaker assumption that if a relative difference is found, that difference will continue to be true later. (If this assumption is only true 95% of the time, it is still useful...)

When you're running multiple tests the time-dependence of performance is often extreme. For instance while running test 1, you introduce test 2, it wins, and you adopt it. Now the time period you're running test 1 has 2 conversion bumps of around 5% in the middle. This is not a problem for properly constructed A/B tests. But it is a serious violation of, Myna handles non-stationary situations, so long as they change quite slowly.

Now what is a properly constructed A/B test? A properly constructed A/B test is one where membership in any particular test version has only random correlations with any other factor that could materially affect performance. Such factors include time, and other running tests. If each user is randomly placed in a test version upon first encountering them, independently of anything else that happens, then you have a well-constructed A/B test. Your versions will be statistically apples/apples even though early and late users are apples/oranges.

3. Code over time gets messy...

The degree to which this will be seen to be an issue depends on personal taste, and how much you're the person who has to dive through templates with a maze of if conditions. If you talk to management, it is almost never a (visible) condition. Programmers on the ground often disagree.

4. Businesses are complex, and often have multiple measures they would like to balance...

Getting the business to have the necessary discussions in an abstract volume is a non-starter in my experience. When you have a concrete example, decisions are easier to make.

Also (as happened in the test that I saw last week) often the discrepancy is a sign to another problem. A better reward criteria would have resulted in making a good pick, but looking at the discrepancy found an issue that wouldn't have been found otherwise, and we'll hopefully wind up with an even better option that we would not have found correctly.

5. Many tests perform differently on existing users and new users...

My point is that the extra analyses are useful. However this is really a secondary or tertiary point since most companies doing A/B testing are not going this extra mile.

As for AdWords, Google has sufficient volume for a traditional A/B test to converge in minutes. They have so much volume that continuous adaptation can just happen for them. Most businesses are not in such a fortunate place.