Comment by noelwelsh
13 years ago
So Ben, looks like it's time to cross swords again :)
1. Real world performance varies over time...
I don't really understand this complaint. With A/B testing, you have to collect enough data before making a decision that the conversion rate fluctuations average out. If you're worried about monthly fluctuations (which is reasonable), and you're doing A/B testing, you need to collect data for a few months. This means months where you aren't optimising. With a bandit algorithm conversion rate fluctuations will cause changes in the bandit behaviour, but they will average out over time just like A/B testing. We can prove this, so I don't know what justification there is for the statement that it is "a big issue for this approach". Indeed I think this points out a weakness of A/B approaches. Firstly, it's arguable that tracking fluctuations in conversion rate is a good thing. They may be temporary, but why shouldn't the algorithm react to temporary changes? Secondly, unlike A/B testing you're not spending months collecting data -- you're actually optimising while you collect data. Finally, what site can afford to spend months not changing, while data is collected?
2. (This is really a special case of #1 - but a very, very important special case...
I see two issues here: 1) the fact the traffic has changed and 2) how long does it take the system to recognise this change?
Addressing 1: Getting technical for a moment, A/B tests assume the distribution generating the data is stationary. This situation violates this assumption, so one shouldn't be using A/B testing here. I'd like to know what is meant by a properly constructed A/B test in this case. Does it mean throwing out all the data before the significant improvement? If you say that a non-stationary distribution is fine, then you can develop a bandit algorithms for non-stationary situations (e.g. http://arxiv.org/abs/0805.3415). Myna handles non-stationary situations, so long as they change quite slowly.
Point 2 is really about convergence rates. There is some work here (http://imagine.enpc.fr/~audibert/Mes%20articles/ALT11.pdf) and the upshot is that convergence rate can be a problem in theory though you can design for it. Notably we haven't observed this issue in practice.
3. Code over time gets messy...
This is a non-issue in my experience. At some point you decide that an experiment is no longer worthwhile. Either you're redesigning things so it doesn't make sense anymore, or the experiment has been running long enough that you are confident of no more changes. When you hit this point you remove the code. It can be a year or a hour after starting the experiment. Either way, you know with a bandit algorithm you will have optimised what traffic you did receive.
4. Businesses are complex, and often have multiple measures they would like to balance...
The glib answer to this is to set your reward criteria correctly, but more below.
5. Many tests perform differently on existing users and new users...
Both of these are really about performing extra analysis on the data. Myna doesn't help here, and we could make it better in this regard by, for example, exporting data. Some of this extra analysis, such as cohort analysis, we can and will automate in the future. There will always be analyses that we can't or don't perform, so there will always be jobs for people who perform these analyses :-) On the other hand, bandit algorithms are used enough in practice (e.g. Google uses them for AdWords) that their utility has been validated. It's a trade-off between time and return. We want to automate the common cases so the analysts can concentrate on the interesting parts.
Do you attempt to control for seasonal effects at all in your models? I'm wondering if allowing for a dependence on day-of-week, time-of-day and other relevant covariates, has any benefits over tracking fluctuations in a generic way?
In order to do anything in practice, you have to make assumptions. With A/B testing your key assumption is that differences in version behavior are relatively stable over time, even if absolute behavior changes. This is not always true - for instance a version that makes references to Christmas will perform differently before and after Dec 25. But it tends to be true often enough that people can rely on that assumption. (With no simplifying assumptions of any kind you get analysis paralysis - that would be bad.)
Now to reply point by point.
1. Real world performance varies over time...
With the assumption that I provided above, A/B testing merely needs to collect enough data to detect real differences in user behavior. After a few days you may not know what the average long-term rate is, but you have strong evidence that one is better.
I have absolutely no idea where you got the impression that A/B tests typically run for months, or that you can't touch anything else while it is running. How long they need to run varies wildly (depends on traffic volume, conversion rates, etc), but for many organizations a week is a fairly long test.
2. (This is really a special case of #1 - but a very, very important special case...
Contrary to your point, A/B tests DO NOT assume that the distribution is stationary. They make the much weaker assumption that if a relative difference is found, that difference will continue to be true later. (If this assumption is only true 95% of the time, it is still useful...)
When you're running multiple tests the time-dependence of performance is often extreme. For instance while running test 1, you introduce test 2, it wins, and you adopt it. Now the time period you're running test 1 has 2 conversion bumps of around 5% in the middle. This is not a problem for properly constructed A/B tests. But it is a serious violation of, Myna handles non-stationary situations, so long as they change quite slowly.
Now what is a properly constructed A/B test? A properly constructed A/B test is one where membership in any particular test version has only random correlations with any other factor that could materially affect performance. Such factors include time, and other running tests. If each user is randomly placed in a test version upon first encountering them, independently of anything else that happens, then you have a well-constructed A/B test. Your versions will be statistically apples/apples even though early and late users are apples/oranges.
3. Code over time gets messy...
The degree to which this will be seen to be an issue depends on personal taste, and how much you're the person who has to dive through templates with a maze of if conditions. If you talk to management, it is almost never a (visible) condition. Programmers on the ground often disagree.
4. Businesses are complex, and often have multiple measures they would like to balance...
Getting the business to have the necessary discussions in an abstract volume is a non-starter in my experience. When you have a concrete example, decisions are easier to make.
Also (as happened in the test that I saw last week) often the discrepancy is a sign to another problem. A better reward criteria would have resulted in making a good pick, but looking at the discrepancy found an issue that wouldn't have been found otherwise, and we'll hopefully wind up with an even better option that we would not have found correctly.
5. Many tests perform differently on existing users and new users...
My point is that the extra analyses are useful. However this is really a secondary or tertiary point since most companies doing A/B testing are not going this extra mile.
As for AdWords, Google has sufficient volume for a traditional A/B test to converge in minutes. They have so much volume that continuous adaptation can just happen for them. Most businesses are not in such a fortunate place.