Comment by ted_dunning

2 years ago

Multi-armed bandit approaches do not imply an immediate feedback loop. They do the best you can do with delayed feedback or with episodic adjustment as well.

So if you are doing A/B tests, it is quite reasonable to use Thompson sampling at fixed intervals to adjust the proportions. If your response variable is not time invariant, this is actually best practice.

1 comment

ted_dunning

orasis 1 year ago

Having significant experience with bandits in production, I strongly recommend only using them for immediate feedback. If the rewards are at all disconnected from the action you likely won’t be happy with the results.