I've been involved with A/B testing for nearly a decade. I assure you that none of these points are in the slightest bit hypothetical.
1. Every kind of lead gen that I have been involved with and thought to measure has large periodic fluctuations in user behavior. Measure it, people behave differently on Friday night and Monday morning.
2. If you're regularly running multiple tests at once, this should be a potential issue fairly frequently.
3. If you really fire and forget, then crud will accumulate. To get rid of that you have to do the same kind of manual evaluation that was supposed to be the downside of A/B testing.
4. Most people do not track multiple metrics on every A/B test. If so, you'll never see how it matters. I make that a standard practice, and regularly see it. (Most recently, last week. I am not at liberty to discuss details.)
5. I first noticed this with email tests. When you change the subject line, you give an artificial boost to existing users who are curious what this new email is. New users do not see the subject line as a change. This boost can easily last long enough for an A/B test to reach significance. I've seen enough bad changes look good because of this effect that I routinely look at cohort analysis.
Does it suffer from the same disadvantages as other bandit optimization approaches?
Yes.
That said, the people there are very smart and are doing something good. But I would be very cautious about time-dependent automatic optimization on a website that is undergoing rapid improvement at the same time.
#1 certainly is, particularly for businesses prone to seasonal fluctuations. In local lead-gen (for instance) you see big changes in conversion based on the time of year.
Sarcasm aside, I've also experienced all of these issues with real world testing and would be interested in hearing your argument as to why you think this is not the case.
Most or all of the points suffer from:
* is that actually true?
* does regular a/b testing not also face that issue?
* was it suggested that you must "set it and forget it"?
* are there no mechanisms for mitigating these issues?
* would using 20% or 30% mitigate the issues?
* are you not allowed to follow the data closely with the bandit approach?
The whole list struck me as a supposed expert in the status quo pooh-poohing an easier approach.
does regular a/b testing not also face that issue?
For the big ones, regular A/B testing does not face that issue. For the more complicated ones, A/B testing does face that issue and I know how to work around it. With a bandit approach I'm not sure I'd have noticed the issue.
was it suggested that you must "set it and forget it"?
are there no mechanisms for mitigating these issues?
There are mechanisms for mitigating some of these issues. The blog does not address those. As soon as you go into them, you get more complicated. It stops being the "20 lines that always beats A/B testing" that the blog promised.
I was doing some back of the envelopes on different methods of mitigating these problems. What I found was that in the best case you turn into
would using 20% or 30% mitigate the issues?
That would lessen the issue that I gave, at the cost of permanently worse performance.
The permanent performance bit can benefit from an example. Suppose that there is a real 5% improvement. The blog's suggested approach would permanently assign 5% of traffic to the worse version, for 0.25% less improvement than you found.
Now suppose you tried a dozen things. 1/3 of them were 5% better, 1/3 were 5% worse, and 1/3 did not matter. The 10% bandit approach causes you to lose 0.25% conversion for each test with a difference, for a permanent roughly 2% drop in your conversion rate over actually making your decisions.
(Note, this is not a problem with all bandit strategies. There are known optimal approaches where the total testing penalty decreases over time. If the assumptions of a k-armed bandit hold, the average returns of the epsilon strategy will lose to A/B test then go with the winner, which in turn loses to more sophisticated bandit approaches. The question of interest is whether the assumptions of the bandit strategy really hold.)
Whichever form of testing you use, you're doing better than not testing. Most of the benefit just comes from actually doing testing. But the A/B testing approach here is not better by hundredths of a percent, it is about a permanent 2% margin. That's not insignificant to a business.
If you move from 10% to 20%, that permanent penalty doubles. You're trading off certain types of short-term errors for long-term errors.
(Again, this is just an artifact of the fact that an epsilon strategy is far from an optimal solution to the bandit problem.)
are you not allowed to follow the data closely with the bandit approach?
I've been involved with A/B testing for nearly a decade. I assure you that none of these points are in the slightest bit hypothetical.
1. Every kind of lead gen that I have been involved with and thought to measure has large periodic fluctuations in user behavior. Measure it, people behave differently on Friday night and Monday morning.
2. If you're regularly running multiple tests at once, this should be a potential issue fairly frequently.
3. If you really fire and forget, then crud will accumulate. To get rid of that you have to do the same kind of manual evaluation that was supposed to be the downside of A/B testing.
4. Most people do not track multiple metrics on every A/B test. If so, you'll never see how it matters. I make that a standard practice, and regularly see it. (Most recently, last week. I am not at liberty to discuss details.)
5. I first noticed this with email tests. When you change the subject line, you give an artificial boost to existing users who are curious what this new email is. New users do not see the subject line as a change. This boost can easily last long enough for an A/B test to reach significance. I've seen enough bad changes look good because of this effect that I routinely look at cohort analysis.
What do you think of Myna, in these respects? Does it suffer from the same disadvantages as other bandit optimization approaches?
http://mynaweb.com/docs/
Does it suffer from the same disadvantages as other bandit optimization approaches?
Yes.
That said, the people there are very smart and are doing something good. But I would be very cautious about time-dependent automatic optimization on a website that is undergoing rapid improvement at the same time.
#1 certainly is, particularly for businesses prone to seasonal fluctuations. In local lead-gen (for instance) you see big changes in conversion based on the time of year.
Wow, you convinced me...
Sarcasm aside, I've also experienced all of these issues with real world testing and would be interested in hearing your argument as to why you think this is not the case.
Sorry, was on an iPad.
Most or all of the points suffer from: * is that actually true? * does regular a/b testing not also face that issue? * was it suggested that you must "set it and forget it"? * are there no mechanisms for mitigating these issues? * would using 20% or 30% mitigate the issues? * are you not allowed to follow the data closely with the bandit approach?
The whole list struck me as a supposed expert in the status quo pooh-poohing an easier approach.
Most or all of the points suffer from:
Let's address them one by one.
is that actually true?
In every case, yes.
does regular a/b testing not also face that issue?
For the big ones, regular A/B testing does not face that issue. For the more complicated ones, A/B testing does face that issue and I know how to work around it. With a bandit approach I'm not sure I'd have noticed the issue.
was it suggested that you must "set it and forget it"?
Not "must", but it was highly recommended. See paragraph 4 of http://stevehanov.ca/blog/index.php?id=132 - look for the words in bold.
are there no mechanisms for mitigating these issues?
There are mechanisms for mitigating some of these issues. The blog does not address those. As soon as you go into them, you get more complicated. It stops being the "20 lines that always beats A/B testing" that the blog promised.
I was doing some back of the envelopes on different methods of mitigating these problems. What I found was that in the best case you turn into
would using 20% or 30% mitigate the issues?
That would lessen the issue that I gave, at the cost of permanently worse performance.
The permanent performance bit can benefit from an example. Suppose that there is a real 5% improvement. The blog's suggested approach would permanently assign 5% of traffic to the worse version, for 0.25% less improvement than you found.
Now suppose you tried a dozen things. 1/3 of them were 5% better, 1/3 were 5% worse, and 1/3 did not matter. The 10% bandit approach causes you to lose 0.25% conversion for each test with a difference, for a permanent roughly 2% drop in your conversion rate over actually making your decisions.
(Note, this is not a problem with all bandit strategies. There are known optimal approaches where the total testing penalty decreases over time. If the assumptions of a k-armed bandit hold, the average returns of the epsilon strategy will lose to A/B test then go with the winner, which in turn loses to more sophisticated bandit approaches. The question of interest is whether the assumptions of the bandit strategy really hold.)
Whichever form of testing you use, you're doing better than not testing. Most of the benefit just comes from actually doing testing. But the A/B testing approach here is not better by hundredths of a percent, it is about a permanent 2% margin. That's not insignificant to a business.
If you move from 10% to 20%, that permanent penalty doubles. You're trading off certain types of short-term errors for long-term errors.
(Again, this is just an artifact of the fact that an epsilon strategy is far from an optimal solution to the bandit problem.)
are you not allowed to follow the data closely with the bandit approach?
I am not sure what you mean here.