Comment by wisty
13 years ago
A stevehanov.ca link? Wow, HN is getting classy again. Please more articles with code, equations and / well visualizations, and less upvoting of badly thought out infograpics (i.e. pretty numbers which would lose nothing by just being presented in a table) and far less self-help pseudo business articles please.
+1 on an article does not mean "I agree". It means "I learnt something".
Bandit optimization has been discussed previously on HN:
always better than A/B testing. In particular, bandit approaches take longer to converge (on average), and don't give you reliable ways to know when to stop testing (when all you know is that you're using an approach that's optimal in the limit of a large N, your only guarantee is that things get better as N gets large). These techniques also make assumptions that aren't valid for a lot of web experiments: identical "bandit" distributions that are constant over time. Throw a few choices that are optimal at different times of day/week/month/year at a bandit optimizer, and it'll just happily fluctuate between them.btilly
13 years ago
saon
13 years ago
3pt14159
13 years ago
wpietri
13 years ago
timr
13 years ago
vibrunazo
13 years ago
pilom
13 years ago
btilly
13 years ago
timr
13 years ago
Also, there's a lot of variation in performance depending on the parameters of your test -- some of which are completely unknowable. So if you want to really learn about this method, you need to read more than a blog post where the author has concluded that bandit optimization is the new pink. For example, here's a pretty readable paper that does an empirical analysis of the various popular bandit algorithms in different paramterizations:
The really scary drawback is what happens if the bandit prefers a suboptimal choice at the same time that you make an independent improvement in your website. Then the bandit is going to add a lot of data for that variation, all of which looks really good for reasons that have nothing to do with what it is supposed to be testing.
This type of error (which can happen very easily on a website going through continuous improvement) can take a very long time to recover from.
A/B tests do not have an issue with this because all versions will have similar mixes of data from before and after the improvement.
This seems like a trivial thing to fix by presenting optimal with limited noise. Let's say it picks the optimal choice x% of the time (some really high number), and when additional changes are made or automatically detected, this percentage drops. If you pick the next most optimal down the line through all of your options, and make x proportional to the period of time since the last change, it should make it reasonably resistant to this kind of biasing in the first place, and can ramp back up at a reasonable rate.
Better yet, make x dependent in some way on time since the last change, and relative change in performance of all options from before and after the change.
I might not be understanding you correctly, but wouldn't the independent improvement also help the random bandit choices? If you are using the forgetting factor this shouldn't be a real issue.
My problem with the bandit method is that I want to show the same test choice to the same person every time he sees the page so you can hide that there is a test. If I do this with the bandit algo then it warps the results because different cohorts have different weightings of the choices and differing cohorts behave very differently for lots of reasons.
1 reply →
Agreed. And I'd add that for me, the point of A/B testing is to learn something. We're not just interested in whether A or B is better; we're interesting in getting better at doing what we do. Studying the A/B results is half the fun.
"We're not just interested in whether A or B is better; we're interesting in getting better at doing what we do."
Absolutely. That's my biggest objection to bandit methods, but it's also the fuzziest objection, and the one least likely to appeal to hyper-analytical people. There's a strong temptation (as we can see from this article) is to treat bandit optimization as a black box that just spits out an infallible answer (i.e. as a Lazy Button).
It's the same human tendency that has led to "you should follow me on twitter" to be one of the more common n-grams on the interwebs (even though it probably never worked for more than Dustin Curtis, and likely causes a counter-intuitive backlash now).
1 reply →
Why can't you do the same study from the bandit results? From what I understand, this the same as A/B testing, except it will only show a suboptimal result to 10% of the users instead of 50% (or more). After a few days of testing, can't you take a look at the statistics and analyze them the same way that you would for an A/B test? Then just stop testing?
The A/B test just gives you the conversion rate of each option. And so does the bandit. As I understand, the only difference is that the bandit will be bugging your users with bad results less often.
5 replies →
"and don't give you reliable ways to know when to stop testing"
Stopping a test when you reach a "statistically significant" result is the wrong way to do A/B testing. In both multi-armed bandit and A/B testing you need to set ahead of time the number of users you are going to run your test against and stop the test at that point regardless of if your result is significant or not.
In theory, yes. In practice, no.
See http://elem.com/~btilly/effective-ab-testing/index.html#asli... for part of a presentation that I did where I actually set up some reasonable fake tests, and ran simulations. What I found is that if there is a significant difference, the probability of coming to the wrong conclusion was (as you would expect) higher, but not that high before the underlying difference made mistakes incredibly unlikely. Conversely if there is only a small real difference, the amount of data needed before you have a significant chance of having accidentally come to a erroneous conclusion is very, very long.
So avoid accepting any result where you don't have at least a few hundred successes and set your thresholds reasonably high. You will make fewer mistakes than you probably fear, and the ones that you make will almost always be very minor. (Oops, I had a 3% chance of accepting the 1% worse solution as probably better.)
Of course if you're aiming to publish academic research, your standards need to be higher. But if you're more interested in getting useful results than publishable ones, you can relax your standards. A lot.
2 replies →
"Stopping a test when you reach a "statistically significant" result is the wrong way to do A/B testing."
Nobody said that it was. But when you do regular split testing, you can use power analysis to estimate the length of time you need to run an experiment to get a significant result at a certain precision:
http://en.wikipedia.org/wiki/Statistical_power
You can't do this (at least, not easily) when you're using bandit models, because none of the assumptions are valid.
I upvoted you because I agree.
But I usually upvote stories based on how interesting they are and whether I feel like they were worth attention. Comments are a mix bag of agreeing, interesting and just well thought out arguments that make me think.
This is completely meta and kind of offtopic, so it should probably be downvoted, but I felt like saying it anyway.
I also +1 a comment where I learn something.
As in meat-space voting, you should always reward good behavior.. be part of the fitness function.
Sadly upvote on stories really means "more of that".
And I really did, so +1 from me.