Comment by magicalhippo

2 years ago

Maybe I misunderstood something, I admit I glossed over parts of the article, but the explanations seemed a bit over-complicated.

In the balls and bucket scenario, ideally you'd want to scan all the buckets and put the ball in the least filled one.

By picking two random buckets and putting the ball in the least filled of the two, you approximate the ideal algorithm.

The chance of picking two relatively filled buckets is inversely proportional to how "bad" it is. If there are just two buckets which have more balls than the others then it's relatively bad to pick those two, but chances are low. And vice versa.

Still, hadn't thought about this before, interesting trick indeed!

Two random choices has a few significant advantages over "find the least filled one":

- It's O(1), rather than O(buckets).

- It's robust to stale data and concurrency. One way systems try to avoid the O(buckets) work is by caching, or by scanning in parallel, either of which cause stale load estimates. "Pick the best" tends to lead to significant over-filling http://www.eecs.harvard.edu/~michaelm/postscripts/handbook20... or https://brooker.co.za/blog/2012/01/17/two-random.html).

- Best-of-k (rather than best-of-two) provides a parameter `k` that allows the system to tradeoff between optimality (higher k is more optimal), and robustness to stale data and concurrent accesses (lower k is more robust). Mitzenmacher's core observation is that the step up from k=1 to k=2 is a massive improvement in optimality with only a small loss in robustness (and much more than most people would expect).

They talked about doing this for AWS lambda at "gigantic scale". I assume at some point scanning all the buckets becomes infeasible.

I think it comes down to complexity. Scanning is O(N) where N is the number of buckets, two random choices is O(1) (if bucket lookup is O(1), e.g. a hash map).

It felt over-complicated to me, too. Your explanation is significantly more concise and is a lot clearer to me

Do you have to pick two random buckets or just one random bucket and one adjacent bucket?

Feels like picking an adjacent bucket would work just as well and increase locality.