Maybe I misunderstood something, I admit I glossed over parts of the article, but the explanations seemed a bit over-complicated.
In the balls and bucket scenario, ideally you'd want to scan all the buckets and put the ball in the least filled one.
By picking two random buckets and putting the ball in the least filled of the two, you approximate the ideal algorithm.
The chance of picking two relatively filled buckets is inversely proportional to how "bad" it is. If there are just two buckets which have more balls than the others then it's relatively bad to pick those two, but chances are low. And vice versa.
Still, hadn't thought about this before, interesting trick indeed!
- Best-of-k (rather than best-of-two) provides a parameter `k` that allows the system to tradeoff between optimality (higher k is more optimal), and robustness to stale data and concurrent accesses (lower k is more robust). Mitzenmacher's core observation is that the step up from k=1 to k=2 is a massive improvement in optimality with only a small loss in robustness (and much more than most people would expect).
I think it comes down to complexity. Scanning is O(N) where N is the number of buckets, two random choices is O(1) (if bucket lookup is O(1), e.g. a hash map).
In a followup post he explains why actually almost the opposite is true - split your buckets into 2 groups, pick a random A and B from each group and if it's a draw, always load A first. This means buckets in the B group will likely be 1 less loaded than servers in the A group.
This is excellent! A prime example of what real engineering work can look like in the field of “software engineering”. I wish we saw more posts sharing this kind of knowledge.
It reminds me of the method redis used for evicting keys: it would randomly choose N keys and evict the oldest of the set. It approximated evicting the oldest key in the database (more or less, depending on your value of N) but was cheaper than keeping track of the oldest keys. And it needed to be cheap because when the database filled up it happened a whole lot.
I think this was a nice article and helped walk through some of the maybe unintuitive maths but I feel like they missed the meat a bit.
With random placement, as was said in the article, each flip is independent. With PoTRC it’s using knowledge of current queue depth.
The author didn’t explain the challenges around tracking queue depth, or why PoTRC is better than say keeping an ordered list of queues depth and picking the smallest one.
I’m assuming the trade off is about cost of managing the sorted list of queues depths vs just O(1) look ups of the random nodes’ depths.
It just would’ve been nice to cover that in the article.
What's wrong with consistent hashing (hash requests and servers into a single coordinate space eg an array, randomly distributed, scan forward from the request until you find a server)?
consistent hashing requires a bunch of additional memory to solve load variance. there's a nice equation to calculate this somewhere that I'd have to search for, but roughly you need 100 virtual slots in your coordinate space per every destination server to reduce standard deviation to about 10%, and 1000 virtual slots to reduce it to less than 5%.
There's some alternative hashing solutions (rendezvous hash, jump hash, maglev) that improve load distribution without the memory overhead, but still have similar challenges with node addition/removal overhead. Also, integrating state into any of these solutions is incredibly difficult and really antithetical to their point (being deterministic). This is fine when you're assigning uniform workloads and preferred when these assignments need to be long-lived, but for request load balancing you're dealing with uneven workloads and short-lived assignments.
This is where power of two load balancing shines, as it allows you to perform stateful load balancing (using queued requests, cpu utilizations, host perf) efficiently with low memory overhead and minimal effort to add/remove nodes. And if you lose a node with state its no big deal because the lifetime of a request is so short.
I liked the approach. I did not know it.
But what are the advantages against taking the least occupied server/bin always?
I have seen good improvements with nginx load balancing when changing its configuration to least_conn, that chooses the server with least connections.
> what are the advantages against taking the least occupied server/bin always
The other answers are saying for algo reasons. Sure, but, pick two is even more better for uptime reasons.
In serving at scale, quite often the "least occupied" server of all of them is one with an error causing it to have least load/connections/whatever.
By starting with a random two, you pick among probably healthy servers, and then the least load.
> good improvements with nginx load balancing when changing its configuration to least_conn, that chooses the server with least connections.
nginx doesn't know why your app is failing to keep connections open long enough to serve them, least_conn between two random results in far fewer 100% "outages" when a single box or subset of boxes among 10s or 1000s is bad.
Not my field but I'm guessing when you have thousands of servers it can be quite the overhead to keeping track of the load of each one for each request.
The proposed algorithm only needs to consider two, and from what I can gather should work in parallel with multiple load balancers without coordination.
It’s more costly to determine the server with absolute least connections since you have to keep track of the ordering of the servers by how many connections they have.
Short version: it's robust to staleness and concurrency, which is important in distributed settings ("find the best" is hard to do when "best" is a moving target).
In large enough systems, it becomes prohibitively expensive to keep consensus of the state of all members.
Sure with 10 or so hosts it’s trivial to select the least busy one but at the scale of 1000s of hosts you begin to run into a ton overhead just to decide where to forward a request.
As many others have said, O(N) for each request vs O(1) while avoiding hot spots.
I do not understand why they chose random assignment as a base line instead of round robin which is really the simplest common load balancing algorithm.
Round robin is hard when you're doing things concurrently and distributed, because state is hard.
A server is taking requests 30 at a time in separate threads. To add a sequential step for picking which target is next, you need to add a synchronous piece of code that all of the threads depend on- a bottleneck. That's doable, but it adds complexity.
And then you have (let's say) 30 servers making calls. Do they each use round robin load balancing separately, or do they try to coordinate? That's difficult.
"Pick two randomly" requires no state. Every thread and server can run the same algorithm without regard for each other.
> "Pick two randomly" requires no state. Every thread and server can run the same algorithm without regard for each other.
That’s not quite true. Picking between two “buckets” still requires knowing how many “balls” are in each which is state. That state can be local to each server or global, that state can be accessed concurrently or synchronously, but you still have the same problem to solve.
I suspect it’s because random assignment is simpler to implement than incremental round robin when the targets are moving rapidly (eg. pods coming and going in K8s). If you are iterating over N replicas in true round-robin fashion and the nth+1 replica comes online, you need to manage inserting it into your rotation to maintain true incremental round robin. To do it probabilistically you just update your weights. This is how kube-proxy sets up iptables based load balancing.
Round robin can work very well when the variance of task sizes is small, but degrades quickly when the variance of task sizes becomes large (i.e. there are some tasks that keep servers busy for minutes, and some for milliseconds).
In distributed systems with large numbers of producers (like one of the systems Mihir has worked, AWS Lambda) round robin essentially degrades into random placement as the number of producers increases.
Yea, order of mag differences in execution times gets messy behind load balancers.
We had an API that was a kitchen sink of different tools, including reporting. Eventually we split reporting off into its own service/LB group because you can get calls delayed that are returning 100 bytes of text being delayed by a few megabytes of data being assembled.
If you have multiple load balancers then doing round robin is hard because they need to talk to each other to ensure proper balancing.
The proposed algorithm should work sufficiently well for multiple load balancers without coordination, assuming the number of load balancers is small compared to the number of servers processing the load.
Maybe I misunderstood something, I admit I glossed over parts of the article, but the explanations seemed a bit over-complicated.
In the balls and bucket scenario, ideally you'd want to scan all the buckets and put the ball in the least filled one.
By picking two random buckets and putting the ball in the least filled of the two, you approximate the ideal algorithm.
The chance of picking two relatively filled buckets is inversely proportional to how "bad" it is. If there are just two buckets which have more balls than the others then it's relatively bad to pick those two, but chances are low. And vice versa.
Still, hadn't thought about this before, interesting trick indeed!
Two random choices has a few significant advantages over "find the least filled one":
- It's O(1), rather than O(buckets).
- It's robust to stale data and concurrency. One way systems try to avoid the O(buckets) work is by caching, or by scanning in parallel, either of which cause stale load estimates. "Pick the best" tends to lead to significant over-filling http://www.eecs.harvard.edu/~michaelm/postscripts/handbook20... or https://brooker.co.za/blog/2012/01/17/two-random.html).
- Best-of-k (rather than best-of-two) provides a parameter `k` that allows the system to tradeoff between optimality (higher k is more optimal), and robustness to stale data and concurrent accesses (lower k is more robust). Mitzenmacher's core observation is that the step up from k=1 to k=2 is a massive improvement in optimality with only a small loss in robustness (and much more than most people would expect).
They talked about doing this for AWS lambda at "gigantic scale". I assume at some point scanning all the buckets becomes infeasible.
I think it comes down to complexity. Scanning is O(N) where N is the number of buckets, two random choices is O(1) (if bucket lookup is O(1), e.g. a hash map).
Yes exactly. You approximate the full algorithm while not paying the overhead.
It felt over-complicated to me, too. Your explanation is significantly more concise and is a lot clearer to me
Do you have to pick two random buckets or just one random bucket and one adjacent bucket?
Feels like picking an adjacent bucket would work just as well and increase locality.
In a followup post he explains why actually almost the opposite is true - split your buckets into 2 groups, pick a random A and B from each group and if it's a draw, always load A first. This means buckets in the B group will likely be 1 less loaded than servers in the A group.
https://medium.com/@mihsathe/load-balancing-a-very-counterin...
I was thinking about that, but I haven't had time to really dig into it. My gut feeling is that it won't be sufficient, but it might be a bad hunch.
2 replies →
This is excellent! A prime example of what real engineering work can look like in the field of “software engineering”. I wish we saw more posts sharing this kind of knowledge.
It reminds me of the method redis used for evicting keys: it would randomly choose N keys and evict the oldest of the set. It approximated evicting the oldest key in the database (more or less, depending on your value of N) but was cheaper than keeping track of the oldest keys. And it needed to be cheap because when the database filled up it happened a whole lot.
I think this was a nice article and helped walk through some of the maybe unintuitive maths but I feel like they missed the meat a bit.
With random placement, as was said in the article, each flip is independent. With PoTRC it’s using knowledge of current queue depth.
The author didn’t explain the challenges around tracking queue depth, or why PoTRC is better than say keeping an ordered list of queues depth and picking the smallest one.
I’m assuming the trade off is about cost of managing the sorted list of queues depths vs just O(1) look ups of the random nodes’ depths.
It just would’ve been nice to cover that in the article.
The beauty is it doesn't need to be particularly for the technique to work.
What's wrong with consistent hashing (hash requests and servers into a single coordinate space eg an array, randomly distributed, scan forward from the request until you find a server)?
https://en.wikipedia.org/wiki/Consistent_hashing
consistent hashing requires a bunch of additional memory to solve load variance. there's a nice equation to calculate this somewhere that I'd have to search for, but roughly you need 100 virtual slots in your coordinate space per every destination server to reduce standard deviation to about 10%, and 1000 virtual slots to reduce it to less than 5%.
There's some alternative hashing solutions (rendezvous hash, jump hash, maglev) that improve load distribution without the memory overhead, but still have similar challenges with node addition/removal overhead. Also, integrating state into any of these solutions is incredibly difficult and really antithetical to their point (being deterministic). This is fine when you're assigning uniform workloads and preferred when these assignments need to be long-lived, but for request load balancing you're dealing with uneven workloads and short-lived assignments.
This is where power of two load balancing shines, as it allows you to perform stateful load balancing (using queued requests, cpu utilizations, host perf) efficiently with low memory overhead and minimal effort to add/remove nodes. And if you lose a node with state its no big deal because the lifetime of a request is so short.
Easier to think of it as a form of cuckoo hashing.
https://en.m.wikipedia.org/wiki/Cuckoo_hashing
I liked the approach. I did not know it. But what are the advantages against taking the least occupied server/bin always? I have seen good improvements with nginx load balancing when changing its configuration to least_conn, that chooses the server with least connections.
> what are the advantages against taking the least occupied server/bin always
The other answers are saying for algo reasons. Sure, but, pick two is even more better for uptime reasons.
In serving at scale, quite often the "least occupied" server of all of them is one with an error causing it to have least load/connections/whatever.
By starting with a random two, you pick among probably healthy servers, and then the least load.
> good improvements with nginx load balancing when changing its configuration to least_conn, that chooses the server with least connections.
nginx doesn't know why your app is failing to keep connections open long enough to serve them, least_conn between two random results in far fewer 100% "outages" when a single box or subset of boxes among 10s or 1000s is bad.
Wouldn't that be better solved with health checking and removing the poor performing server from the pool?
Not my field but I'm guessing when you have thousands of servers it can be quite the overhead to keeping track of the load of each one for each request.
The proposed algorithm only needs to consider two, and from what I can gather should work in parallel with multiple load balancers without coordination.
It’s more costly to determine the server with absolute least connections since you have to keep track of the ordering of the servers by how many connections they have.
See my reply here: https://news.ycombinator.com/item?id=37175685
Short version: it's robust to staleness and concurrency, which is important in distributed settings ("find the best" is hard to do when "best" is a moving target).
In large enough systems, it becomes prohibitively expensive to keep consensus of the state of all members.
Sure with 10 or so hosts it’s trivial to select the least busy one but at the scale of 1000s of hosts you begin to run into a ton overhead just to decide where to forward a request.
As many others have said, O(N) for each request vs O(1) while avoiding hot spots.
O(log n) vs O(1)
Is there a way to keep a list with changing values sorted in real time in order to use a log n search algorithm?
2 replies →
I do not understand why they chose random assignment as a base line instead of round robin which is really the simplest common load balancing algorithm.
Round robin is hard when you're doing things concurrently and distributed, because state is hard.
A server is taking requests 30 at a time in separate threads. To add a sequential step for picking which target is next, you need to add a synchronous piece of code that all of the threads depend on- a bottleneck. That's doable, but it adds complexity.
And then you have (let's say) 30 servers making calls. Do they each use round robin load balancing separately, or do they try to coordinate? That's difficult.
"Pick two randomly" requires no state. Every thread and server can run the same algorithm without regard for each other.
> "Pick two randomly" requires no state. Every thread and server can run the same algorithm without regard for each other.
That’s not quite true. Picking between two “buckets” still requires knowing how many “balls” are in each which is state. That state can be local to each server or global, that state can be accessed concurrently or synchronously, but you still have the same problem to solve.
1 reply →
I suspect it’s because random assignment is simpler to implement than incremental round robin when the targets are moving rapidly (eg. pods coming and going in K8s). If you are iterating over N replicas in true round-robin fashion and the nth+1 replica comes online, you need to manage inserting it into your rotation to maintain true incremental round robin. To do it probabilistically you just update your weights. This is how kube-proxy sets up iptables based load balancing.
Round robin can work very well when the variance of task sizes is small, but degrades quickly when the variance of task sizes becomes large (i.e. there are some tasks that keep servers busy for minutes, and some for milliseconds).
In distributed systems with large numbers of producers (like one of the systems Mihir has worked, AWS Lambda) round robin essentially degrades into random placement as the number of producers increases.
Yea, order of mag differences in execution times gets messy behind load balancers.
We had an API that was a kitchen sink of different tools, including reporting. Eventually we split reporting off into its own service/LB group because you can get calls delayed that are returning 100 bytes of text being delayed by a few megabytes of data being assembled.
If you have multiple load balancers then doing round robin is hard because they need to talk to each other to ensure proper balancing.
The proposed algorithm should work sufficiently well for multiple load balancers without coordination, assuming the number of load balancers is small compared to the number of servers processing the load.