Comment by btilly
13 years ago
This is thought-provoking, which is good. However there are significant issues with the approach.
1. Real world performance varies over time. For instance there are typically daily, weekly and monthly conversion rate fluctuations. Not an issue for A/B testing, but a big issue for this approach if a random switch in direction happens at the same time that conversion fluctuations happen to head in a good direction.
2. (This is really a special case of #1 - but a very, very important special case.) This approach creates long-lasting interaction effects between tests and independent changes. That requires explanation. Suppose you're running a test. Version A is somewhat better. But version B is temporarily looking slightly better when you make a significant improvement to your website (maybe you started another test that works). Now you're adding a lot of good traffic to version B (good because of the other change) and very little of the new good traffic to version A. This new version B traffic soundly beats your old version A traffic. This correlation between time and website performance will continue until the old version A traffic is completely swamped by new version A traffic. With only 5% of your traffic going to version A, this can easily take 100x as long as your test has been running - or more. (Properly constructed A/B tests do not suffer this statistical anomaly.)
3. Code over time gets messy. One of the most important characteristics of A/B testing is that you can delete the mess and move on. With this approach you can't - it just hangs around adding to your technical debt.
4. Businesses are complex, and often have multiple measures they would like to balance. For instance in a recent test, conversion to click was hurt, conversion to a person who clicked 5x was helped. A/B testing let us notice that something weird was going on and think about what we really cared about. This automated approach would make a decision and could have hidden a real problem.
5. Many tests perform differently on existing users and new users. A/B testing with proper cohort analysis can let you tease this out and decide accordingly. This approach doesn't give you that kind of sophistication.
These also don't make much sense to me, but to avoid getting d/ved to oblivion I'll say why.
Firstly both you and the author aren't really quoting any maths or real world papers or anything. He's backing his up with saying that all the advertisers are using this over A/B though, which is a pretty strong argument. But it occurs to me that for most of your points to stand you need to tackle this particular paragraph:
Like many techniques in machine learning, the simplest strategy is hard to beat. More complicated techniques are worth considering, but they may eke out only a few hundredths of a percentage point of performance. The strategy that has been shown to win out time after time in practical problems is the epsilon-greedy method.
So to tackle you points:
1. Only stands if you show his paragraph above to be wrong. Does epsilon-greedy only work on consistent payouts, or does it work on fluctuating payouts too? It would seem to me that this would be a common occurrence in advertising on websites. I imagine there is some research out there to settle this!
2. He addresses this directly in the post: This won't adapt to change. (Your visitors probably don't change. But if you really want to, in the reward function, multiply the old reward value by a forgetting factor)
3. There is no difference between this and A/B testing, the mock code he shows is supposed to go in your A/B testing framework, the code in the controllers is supposed to be the same (and you can remove it the same way).
4. Isn't A/B testing just as bad at testing multiple factors? Why wouldn't you 'notice', you should theoretically see the same percentages for each stage. And would be able to notice the oddity.
5. Again this only stands if you show his paragraph above to be wrong. You are suggesting that a complicated strategy will win, which he says isn't true.
I showed that paragraph to be wrong. Not slightly, but wildly in the common scenario where, while a test is running, you make another change to your website that lifts conversion.
The difference in this case is not a few hundredths of a point of convergence. It is a question of potentially drawing wrong conclusions about the wrong version for 100x as long as you need to.
What your forgetting is it's adaptive. That 10% random factor means it's constantly adding in new information. Also, you can graph trends over time so if you make a significant change you could reset the historical data to zero, but simply letting it run and it will adapt to the change.
If your really concerned about rapidly changing events just add a diminishing return. AKA multiply both the success and failure number by say .9999 after each test. so 34/2760 = 34.9966/2760.724 on success or 33.9966/2760.724 on a failure.
2 replies →
I got a very different impression of this method than you did because I saw it as, just as with an A/B test, when you make a major change to the site, you reset your counters. This makes your point #2 mute. As for number 5, why couldn't you use the same cohort analysis with this method?
What is a major change?
I've seen minor tweaks to a form raise conversions by 20%. The person running a particular test may not even know that a particular change was significant, or even that it happened. The change could be as subtle as another running test realized that version x is better, interfering with existing tests.
As for #5, with an A/B test you run into these situations, you're able to break down and crunch the numbers in multiple ways, and then have a discussion about how you want to proceed. But with a multi-armed bandit approach whatever complexities you have not thought of and baked into your approach, are not going to be noticed.
moot.
mute == "does not talk"
moot == "of little or no practical value or meaning; purely academic."
moot/mute is a tricky word combo because smart people can justify using "MUTE" instead of moot, and english is such a terrible language w.r.t. spelling there are no good clues to use one versus the other (and "moot" is far less commonly used).
Off-topic: Most of the time, such grammar/spelling corrections get downvoted because people think they are nitpicking and don't add anything to the conversation, or are rude somehow, but as a non-English, I appreciate them a lot!
It seems to me that A/B (or A/B/C/...) testing as described by btilly, and epsilon-greedy multi-armed bandit optimisation as described by Steve Hanov are two points on a continuum. In A/B/... testing, you're exploring 100% of your traffic, and eventually you declare the test "done", and start exploiting 100% of your traffic (sending it to the "best" slice). You have the advantage that during the exploration phase, your assignment of users to test slices is completely random, uncorrelated with anything else. In the epsilon-greedy approach, you're exploring with 10% of your traffic, and exploiting the current best-looking choice with 90% of your traffic. The trouble is that now 90% of your data is coming from one slice, and which slice it's coming from could vary over time. This means, as Ben points out, that your choice of slice could be 90% correlated with anything else that varies over time.
One approach would be to ignore the data from the 90% exploitation; that way, you only get 10% of the data, but its slice assignment is completely random and uncorrelated with anything else that might be happening. The trouble is that now you're running an A/B/... test on only 10% of your traffic, which means that it will converge 10x slower than if you were running it on 100% of your traffic.
However, it seems to me that the extra 90% of data that I've proposed ignoring isn't that useful, because it's only coming from one slice at a time. What you really want is to get more data from the slices you know least about. I suspect there are reinforcement learning algorithms that take into account not just the reward rate for each slice, but the current level of certainty with which the algorithm knows the reward rate, so it can collect more data about the slices it knows the least about, and stop collecting data about the slices for which it already has a fairly accurate reward estimate. The question is, are there such algorithms that can also handle non-stationary reward distributions? And how much tuning and tweaking do they require?
That data from the 90% on slice X is valuable because you're trying to confirm, or falsify, that X is better, quickly. The strategy is "look closely at this pointy thing to see if it's a needle or a sharp piece of straw."
But... "better" than the other choices, which are only getting 1/10th the traffic. The amount of traffic sent to a choice should be a function of how good you currently think it is, and how much more data you need to be sufficiently certain about that choice. So choices that have insufficient data should get more traffic, and choices that already have sufficient data to be sure they're worse than some other choice should get very little traffic. How much traffic they should get depends on how certain you are of stationarity (is that a word?).
re: #3... you can also remove items at will. If you look at the counters and find option C is consistently the lowest performer, then you can remove it from the code and remove the counter. It doesn't mean cruft needs to pile up.
But how do you know when you can do that? Particularly with the slower convergence issue, and the potential presence of confounding factors like #2?
So Ben, looks like it's time to cross swords again :)
1. Real world performance varies over time...
I don't really understand this complaint. With A/B testing, you have to collect enough data before making a decision that the conversion rate fluctuations average out. If you're worried about monthly fluctuations (which is reasonable), and you're doing A/B testing, you need to collect data for a few months. This means months where you aren't optimising. With a bandit algorithm conversion rate fluctuations will cause changes in the bandit behaviour, but they will average out over time just like A/B testing. We can prove this, so I don't know what justification there is for the statement that it is "a big issue for this approach". Indeed I think this points out a weakness of A/B approaches. Firstly, it's arguable that tracking fluctuations in conversion rate is a good thing. They may be temporary, but why shouldn't the algorithm react to temporary changes? Secondly, unlike A/B testing you're not spending months collecting data -- you're actually optimising while you collect data. Finally, what site can afford to spend months not changing, while data is collected?
2. (This is really a special case of #1 - but a very, very important special case...
I see two issues here: 1) the fact the traffic has changed and 2) how long does it take the system to recognise this change?
Addressing 1: Getting technical for a moment, A/B tests assume the distribution generating the data is stationary. This situation violates this assumption, so one shouldn't be using A/B testing here. I'd like to know what is meant by a properly constructed A/B test in this case. Does it mean throwing out all the data before the significant improvement? If you say that a non-stationary distribution is fine, then you can develop a bandit algorithms for non-stationary situations (e.g. http://arxiv.org/abs/0805.3415). Myna handles non-stationary situations, so long as they change quite slowly.
Point 2 is really about convergence rates. There is some work here (http://imagine.enpc.fr/~audibert/Mes%20articles/ALT11.pdf) and the upshot is that convergence rate can be a problem in theory though you can design for it. Notably we haven't observed this issue in practice.
3. Code over time gets messy...
This is a non-issue in my experience. At some point you decide that an experiment is no longer worthwhile. Either you're redesigning things so it doesn't make sense anymore, or the experiment has been running long enough that you are confident of no more changes. When you hit this point you remove the code. It can be a year or a hour after starting the experiment. Either way, you know with a bandit algorithm you will have optimised what traffic you did receive.
4. Businesses are complex, and often have multiple measures they would like to balance...
The glib answer to this is to set your reward criteria correctly, but more below.
5. Many tests perform differently on existing users and new users...
Both of these are really about performing extra analysis on the data. Myna doesn't help here, and we could make it better in this regard by, for example, exporting data. Some of this extra analysis, such as cohort analysis, we can and will automate in the future. There will always be analyses that we can't or don't perform, so there will always be jobs for people who perform these analyses :-) On the other hand, bandit algorithms are used enough in practice (e.g. Google uses them for AdWords) that their utility has been validated. It's a trade-off between time and return. We want to automate the common cases so the analysts can concentrate on the interesting parts.
Do you attempt to control for seasonal effects at all in your models? I'm wondering if allowing for a dependence on day-of-week, time-of-day and other relevant covariates, has any benefits over tracking fluctuations in a generic way?
In order to do anything in practice, you have to make assumptions. With A/B testing your key assumption is that differences in version behavior are relatively stable over time, even if absolute behavior changes. This is not always true - for instance a version that makes references to Christmas will perform differently before and after Dec 25. But it tends to be true often enough that people can rely on that assumption. (With no simplifying assumptions of any kind you get analysis paralysis - that would be bad.)
Now to reply point by point.
1. Real world performance varies over time...
With the assumption that I provided above, A/B testing merely needs to collect enough data to detect real differences in user behavior. After a few days you may not know what the average long-term rate is, but you have strong evidence that one is better.
I have absolutely no idea where you got the impression that A/B tests typically run for months, or that you can't touch anything else while it is running. How long they need to run varies wildly (depends on traffic volume, conversion rates, etc), but for many organizations a week is a fairly long test.
2. (This is really a special case of #1 - but a very, very important special case...
Contrary to your point, A/B tests DO NOT assume that the distribution is stationary. They make the much weaker assumption that if a relative difference is found, that difference will continue to be true later. (If this assumption is only true 95% of the time, it is still useful...)
When you're running multiple tests the time-dependence of performance is often extreme. For instance while running test 1, you introduce test 2, it wins, and you adopt it. Now the time period you're running test 1 has 2 conversion bumps of around 5% in the middle. This is not a problem for properly constructed A/B tests. But it is a serious violation of, Myna handles non-stationary situations, so long as they change quite slowly.
Now what is a properly constructed A/B test? A properly constructed A/B test is one where membership in any particular test version has only random correlations with any other factor that could materially affect performance. Such factors include time, and other running tests. If each user is randomly placed in a test version upon first encountering them, independently of anything else that happens, then you have a well-constructed A/B test. Your versions will be statistically apples/apples even though early and late users are apples/oranges.
3. Code over time gets messy...
The degree to which this will be seen to be an issue depends on personal taste, and how much you're the person who has to dive through templates with a maze of if conditions. If you talk to management, it is almost never a (visible) condition. Programmers on the ground often disagree.
4. Businesses are complex, and often have multiple measures they would like to balance...
Getting the business to have the necessary discussions in an abstract volume is a non-starter in my experience. When you have a concrete example, decisions are easier to make.
Also (as happened in the test that I saw last week) often the discrepancy is a sign to another problem. A better reward criteria would have resulted in making a good pick, but looking at the discrepancy found an issue that wouldn't have been found otherwise, and we'll hopefully wind up with an even better option that we would not have found correctly.
5. Many tests perform differently on existing users and new users...
My point is that the extra analyses are useful. However this is really a secondary or tertiary point since most companies doing A/B testing are not going this extra mile.
As for AdWords, Google has sufficient volume for a traditional A/B test to converge in minutes. They have so much volume that continuous adaptation can just happen for them. Most businesses are not in such a fortunate place.
1 is definitely an issue, but has a solution. I posted an article about this a while back with a simple example of the problem and a possible solution. The example is a staged rollout but it illustrates the same point. If you change proportions of tests over time as there are independent changes, you can get skewed results: http://blog.avidlifemedia.com/2011/12/23/advanced-ab-testing...
2. If you change the thing you're testing, you need to restart the test. Otherwise, you'll have a meaningless jumble of results from two separate things.
re #1,
could this be solved with an exponential decay?
aka, saying that a click from 1 month ago is less valuable than someone clicking today.
By tweaking the decay you could change how quickly the algorithm will sway when the conversion rate changes.
EDIT: I just saw that rauljara suggested this below: https://news.ycombinator.com/item?id=4040230
I don't think any of those points are very true.
I've been involved with A/B testing for nearly a decade. I assure you that none of these points are in the slightest bit hypothetical.
1. Every kind of lead gen that I have been involved with and thought to measure has large periodic fluctuations in user behavior. Measure it, people behave differently on Friday night and Monday morning.
2. If you're regularly running multiple tests at once, this should be a potential issue fairly frequently.
3. If you really fire and forget, then crud will accumulate. To get rid of that you have to do the same kind of manual evaluation that was supposed to be the downside of A/B testing.
4. Most people do not track multiple metrics on every A/B test. If so, you'll never see how it matters. I make that a standard practice, and regularly see it. (Most recently, last week. I am not at liberty to discuss details.)
5. I first noticed this with email tests. When you change the subject line, you give an artificial boost to existing users who are curious what this new email is. New users do not see the subject line as a change. This boost can easily last long enough for an A/B test to reach significance. I've seen enough bad changes look good because of this effect that I routinely look at cohort analysis.
What do you think of Myna, in these respects? Does it suffer from the same disadvantages as other bandit optimization approaches?
http://mynaweb.com/docs/
1 reply →
#1 certainly is, particularly for businesses prone to seasonal fluctuations. In local lead-gen (for instance) you see big changes in conversion based on the time of year.
Wow, you convinced me...
Sarcasm aside, I've also experienced all of these issues with real world testing and would be interested in hearing your argument as to why you think this is not the case.
Sorry, was on an iPad.
Most or all of the points suffer from: * is that actually true? * does regular a/b testing not also face that issue? * was it suggested that you must "set it and forget it"? * are there no mechanisms for mitigating these issues? * would using 20% or 30% mitigate the issues? * are you not allowed to follow the data closely with the bandit approach?
The whole list struck me as a supposed expert in the status quo pooh-poohing an easier approach.
1 reply →