Comment by sweezyjeezy
5 days ago
One of the assumptions of vanilla multi-armed bandits is that the underlying reward rates are fixed. It's not valid to assume that in a lot of cases, including e-commerce. The author is dismissive and hard wavy about this and having worked in in e-commerce SaaS I'd be a bit more cautious.
Imagine that you are running MAB on an website with a control/treatment variant. After a bit you end up sampling the treatment a little more, say 60/40. You now start running a sale - and the conversion rate for both sides goes up equally. But since you are now sampling more from the treatment variant, its aggregate conversion rate goes up faster than the control - you start weighting even more towards that variant.
Fluctuating reward rates are everywhere in e-commerce, and tend to destabilise MAB proportions, even on two identical variants, they can even cause it to lean towards the wrong one. There are more sophisticated MAB approaches that try to remove the identical reward-rate assumption - they have to model a lot more uncertainty, and so optimise more conservatively.
> ...the conversion rate for both sides goes up equally.
If the conversion rate "goes up equally", why did you not measure this and use that as a basis for your decisions?
> its aggregate conversion rate goes up faster than the control - you start weighting even more towards that variant.
This sounds simply like using bad math. Wouldn't this kill most experiments that start with 10% for the variant that do not provide 10x the improvement?
No. This isn't just bad math.
The problem here is that the weighting of the alternatives changes over time and the thing you are measuring may also change. If you start by measuring the better option, but then bring in the worse option in a better general climate, you could easily conclude the worse option is better.
To give a concrete example, suppose you have two versions of your website, one in English and one in Japanese. Worldwide, Japanese speakers tend to be awake at different hours than English speakers. If you don't run your tests over full days, you may bias the results to one audience or the other. Even worse, weekend visitors may be much different than weekday visitors so you may need to slow down to full weeks for your tests.
Changing tests slowly may mean that you can only run a few tests unless you are looking at large effects which will show through the confounding effects.
And that leads back to the most prominent normal use which is progressive deployments. The goal there is to test whether the new version is catastrophically worse than the old one so that as soon as you have error bars that bound the new performance away from catastrophe, you are good to go.
I mean, sure you could test over only part of the day, but if you do, that is, imho, bad math.
Eg. I could sum up 10 (decimal) and 010 (octal) as 20, but because they were the same digits in different numbering systems, you need to normalize the values first to the same base.
Or I could add up 5 GBP, 5 USD, 5 EUR and 5 JPY and claim I got 20 of "currency", but it doesn't really mean anything.
Otherwise, we are comparing incomparable values, and that's bad math.
Sure, percentages is what everybody gets wrong (hey percentage points vs percentage), but that does not make them not wrong. And knowing what is comparable when you simply talk in percentages, even more so (as per your examples).
It is a universal truth that people fuck up statistical math.
If you aren’t testing at exactly 50/50 - and you can’t because my plan for visiting a site and for how long will never be equivalent to your plan, then any other factors that can affect conversion rate will cause one partition to go up faster than the other. You have to test at a level of Amazon to get statistical significance anyway.
And as many if us have told people until they’re blue in the face: we (you) are not a FAANG company and pretending to be one won’t work.
One of the other comment threads has a link to a James LeDoux post about MAB with EG, UCB1, BUCB and EXP3, with EXP3, from what I've seen, marketed as an "adversarial" MAB method [0] [1].
I found a post [2] of doing some very rudimentary testing on EXP3 against UCB to see if it performs better in what could be considered an adversarial environment. From what I can tell, it didn't perform all that well.
Do you, or anyone else, have an actual use case for when EXP3 performs better than any of the standard alternatives (UCB, TS, EG)? Do you have experience with running MAB in adversarial environments? Have you found EXP3 performs well?
[0] https://www.jeremykun.com/2013/11/08/adversarial-bandits-and...
Motivations can vary on a diurnal basis too. Or based on location. It means something different if I’m using homedepot.com at home or standing in an aisle at the store.
And with physical retailers with online catalogs, an online sale of one item may cannibalize an in-store purchase of not only that item but three other incidental purchases.
But at the end of the day your 60/40 example is just another way of saying: you don’t try to compare two fractions with a different denominator. It’s a rookie mistake.
Good point about fluctuating rates for e.g the sales period. But couldn't you then pick a metric that doesn't fluctuate?
Out of curiosity, where did you work? In the same space as you.
I don't follow. In this case would sampling 50/50 always give better/unbiased results on the experiment?
Sampling 50/50 will always give you the best chance of picking the best ultimate 'winner' in a fixed time horizon, at the cost of only sampling the winning variant 50% of the time. That's true if the reward rates are fixed or not. But some changes in reward rates will also cause MAB aggregate statistics to skew in a way that they shouldn't for a 50/50 split yeah.
What do you think of using the epsilon-first approach then? We could explore for that fixed time horizon, then start choosing greedy after that. I feel like the only downside is that adding new arms becomes more complicated.
1 reply →
Yes.
I agree that there's an exploration-exploitation tradeoff, but for what you specifically suggest wouldn't you presumably just normalize by sample size? You wouldn't allocate based off total conversions, but rather a percentage.
Imagine a scenario where option B does 10x better than option A during the morning hours but -2x worse the rest of the day. If you start the multi armed bandit in the morning it could converge to option B quickly and dominate the rest of the day even though it performs worse then.
Or in the above scenario option B performs a lot better than option A but only with the sale going, otherwise option B performs worse.
One of the problems we caught only once or twice: mobile versus desktop shifting with time of day, and what works on mobile may work worse than on desktop.
We weren’t at the level of hacking our users, just looking at changes that affect response time and resource utilizations, and figuring out why a change actually seems to have made things worse instead of better. It’s easy for people to misread graphs. Especially if the graphs are using Lying with Statistics anti patterns.
Yes but here's a exaggerated version - say were to sample for a week at 50/50 when the base conversion rate was at 4%, then we sample at 25/75 for a week with the base conversion rate bumped up to 8% due to a sale.
The average base rate for the first variant is 5.3%, the second is 6.4%. Generally the favoured variant's average will shift faster because we are sampling it more.
Uhm, this still sounds like just bad math.
While it's non-obvious this is the effect, anyone analyzing the results should be aware of it and should only compare weighted averages, or per distinct time periods.
And therein is the largest problem with A/B testing: it's mostly done by people not understanding the math subtleties, thus they will misinterpret results in either direction.
1 reply →