Comment by LPisGood

6 days ago

I haven’t seen these particular shortcomings before, but I certainly agree that if your data is bad, this ML approach will also be bad.

Can you share some more details about your experiences with those particular types of failures?

Sure! A really simple (and common) example would be a setup w/ treatment A and treatment B, your code does "if session_assignment == A .... else .... B" . In the else branch you do something that for whatever reason causes misbehavior (perhaps it sometimes crashes or throws an exception or uses a buffer that drops records under high load to protect availability). That's suprisingly common. Or perhaps you were hashing on the wrong key to generate session assignments--ex you accidentally used an ID that expires after 24 hours of inactivity...now only highly active people get correctly sampled.

Another common one I saw was due to different systems handling different treatments, and there being caching discrepancies between the two, like esp in a MAB where allocations are constantly changing, if one system has a much longer TTL than the other then you might see allocation lags for one treatment and not the other, biasing the data. Or perhaps one system deploys much more frequently and the load balancer draining doesn't wait for records to finish uploading before it kills the process.

The most subtle ones were eligibility biases, where one treatment might cause users to drop out of an experiment entirely. Like if you have a signup form and you want to measure long-term retention, and one treatment causes some cohorts to not complete the signup entirely.

There are definitely mitigations for these issues, like you can monitor the expected vs. actual allocations and alert if they go out-of-whack. That has its own set of problems and statistics though.