Comment by wisty

13 years ago

A stevehanov.ca link? Wow, HN is getting classy again. Please more articles with code, equations and / well visualizations, and less upvoting of badly thought out infograpics (i.e. pretty numbers which would lose nothing by just being presented in a table) and far less self-help pseudo business articles please.

+1 on an article does not mean "I agree". It means "I learnt something".

23 comments

wisty

timr 13 years ago

Bandit optimization has been discussed previously on HN:

always better than A/B testing. In particular, bandit approaches take longer to converge (on average), and don't give you reliable ways to know when to stop testing (when all you know is that you're using an approach that's optimal in the limit of a large N, your only guarantee is that things get better as N gets large). These techniques also make assumptions that aren't valid for a lot of web experiments: identical "bandit" distributions that are constant over time. Throw a few choices that are optimal at different times of day/week/month/year at a bandit optimizer, and it'll just happily fluctuate between them.

Also, there's a lot of variation in performance depending on the parameters of your test -- some of which are completely unknowable. So if you want to really learn about this method, you need to read more than a blog post where the author has concluded that bandit optimization is the new pink. For example, here's a pretty readable paper that does an empirical analysis of the various popular bandit algorithms in different paramterizations:

btilly 13 years ago

The really scary drawback is what happens if the bandit prefers a suboptimal choice at the same time that you make an independent improvement in your website. Then the bandit is going to add a lot of data for that variation, all of which looks really good for reasons that have nothing to do with what it is supposed to be testing.

This type of error (which can happen very easily on a website going through continuous improvement) can take a very long time to recover from.

A/B tests do not have an issue with this because all versions will have similar mixes of data from before and after the improvement.

saon 13 years ago

This seems like a trivial thing to fix by presenting optimal with limited noise. Let's say it picks the optimal choice x% of the time (some really high number), and when additional changes are made or automatically detected, this percentage drops. If you pick the next most optimal down the line through all of your options, and make x proportional to the period of time since the last change, it should make it reasonably resistant to this kind of biasing in the first place, and can ramp back up at a reasonable rate.
Better yet, make x dependent in some way on time since the last change, and relative change in performance of all options from before and after the change.
3pt14159 13 years ago

I might not be understanding you correctly, but wouldn't the independent improvement also help the random bandit choices? If you are using the forgetting factor this shouldn't be a real issue.
My problem with the bandit method is that I want to show the same test choice to the same person every time he sees the page so you can hide that there is a test. If I do this with the bandit algo then it warps the results because different cohorts have different weightings of the choices and differing cohorts behave very differently for lots of reasons.

1 reply →

wpietri 13 years ago

Agreed. And I'd add that for me, the point of A/B testing is to learn something. We're not just interested in whether A or B is better; we're interesting in getting better at doing what we do. Studying the A/B results is half the fun.

timr 13 years ago

"We're not just interested in whether A or B is better; we're interesting in getting better at doing what we do."
Absolutely. That's my biggest objection to bandit methods, but it's also the fuzziest objection, and the one least likely to appeal to hyper-analytical people. There's a strong temptation (as we can see from this article) is to treat bandit optimization as a black box that just spits out an infallible answer (i.e. as a Lazy Button).
It's the same human tendency that has led to "you should follow me on twitter" to be one of the more common n-grams on the interwebs (even though it probably never worked for more than Dustin Curtis, and likely causes a counter-intuitive backlash now).

1 reply →
vibrunazo 13 years ago

Why can't you do the same study from the bandit results? From what I understand, this the same as A/B testing, except it will only show a suboptimal result to 10% of the users instead of 50% (or more). After a few days of testing, can't you take a look at the statistics and analyze them the same way that you would for an A/B test? Then just stop testing?
The A/B test just gives you the conversion rate of each option. And so does the bandit. As I understand, the only difference is that the bandit will be bugging your users with bad results less often.

5 replies →

pilom 13 years ago

"and don't give you reliable ways to know when to stop testing"

Stopping a test when you reach a "statistically significant" result is the wrong way to do A/B testing. In both multi-armed bandit and A/B testing you need to set ahead of time the number of users you are going to run your test against and stop the test at that point regardless of if your result is significant or not.

btilly 13 years ago

In theory, yes. In practice, no.
See http://elem.com/~btilly/effective-ab-testing/index.html#asli... for part of a presentation that I did where I actually set up some reasonable fake tests, and ran simulations. What I found is that if there is a significant difference, the probability of coming to the wrong conclusion was (as you would expect) higher, but not that high before the underlying difference made mistakes incredibly unlikely. Conversely if there is only a small real difference, the amount of data needed before you have a significant chance of having accidentally come to a erroneous conclusion is very, very long.
So avoid accepting any result where you don't have at least a few hundred successes and set your thresholds reasonably high. You will make fewer mistakes than you probably fear, and the ones that you make will almost always be very minor. (Oops, I had a 3% chance of accepting the 1% worse solution as probably better.)
Of course if you're aiming to publish academic research, your standards need to be higher. But if you're more interested in getting useful results than publishable ones, you can relax your standards. A lot.

2 replies →
timr 13 years ago

"Stopping a test when you reach a "statistically significant" result is the wrong way to do A/B testing."
Nobody said that it was. But when you do regular split testing, you can use power analysis to estimate the length of time you need to run an experiment to get a significant result at a certain precision:
http://en.wikipedia.org/wiki/Statistical_power
You can't do this (at least, not easily) when you're using bandit models, because none of the assumptions are valid.

Swizec 13 years ago

I upvoted you because I agree.

But I usually upvote stories based on how interesting they are and whether I feel like they were worth attention. Comments are a mix bag of agreeing, interesting and just well thought out arguments that make me think.

This is completely meta and kind of offtopic, so it should probably be downvoted, but I felt like saying it anyway.

r00fus 13 years ago

I also +1 a comment where I learn something.

As in meat-space voting, you should always reward good behavior.. be part of the fitness function.

gbog 13 years ago

Sadly upvote on stories really means "more of that".

evropamx 13 years ago

And I really did, so +1 from me.