Comment by trabant00

2 years ago

I do not understand why they chose random assignment as a base line instead of round robin which is really the simplest common load balancing algorithm.

7 comments

trabant00

mabbo 2 years ago

Round robin is hard when you're doing things concurrently and distributed, because state is hard.

A server is taking requests 30 at a time in separate threads. To add a sequential step for picking which target is next, you need to add a synchronous piece of code that all of the threads depend on- a bottleneck. That's doable, but it adds complexity.

And then you have (let's say) 30 servers making calls. Do they each use round robin load balancing separately, or do they try to coordinate? That's difficult.

"Pick two randomly" requires no state. Every thread and server can run the same algorithm without regard for each other.

haimez 2 years ago
> "Pick two randomly" requires no state. Every thread and server can run the same algorithm without regard for each other.
That’s not quite true. Picking between two “buckets” still requires knowing how many “balls” are in each which is state. That state can be local to each server or global, that state can be accessed concurrently or synchronously, but you still have the same problem to solve.
- kuchenbecker 2 years ago
  
  In addition, there is a physical machine somewhere that has all the actual data and it's costly to move. The machine you're connecting to can get taken down by a noisy neighbor and rebalancing data is expensive.

phrotoma 2 years ago

I suspect it’s because random assignment is simpler to implement than incremental round robin when the targets are moving rapidly (eg. pods coming and going in K8s). If you are iterating over N replicas in true round-robin fashion and the nth+1 replica comes online, you need to manage inserting it into your rotation to maintain true incremental round robin. To do it probabilistically you just update your weights. This is how kube-proxy sets up iptables based load balancing.

mjb 2 years ago

Round robin can work very well when the variance of task sizes is small, but degrades quickly when the variance of task sizes becomes large (i.e. there are some tasks that keep servers busy for minutes, and some for milliseconds).

In distributed systems with large numbers of producers (like one of the systems Mihir has worked, AWS Lambda) round robin essentially degrades into random placement as the number of producers increases.

pixl97 2 years ago

Yea, order of mag differences in execution times gets messy behind load balancers.
We had an API that was a kitchen sink of different tools, including reporting. Eventually we split reporting off into its own service/LB group because you can get calls delayed that are returning 100 bytes of text being delayed by a few megabytes of data being assembled.

magicalhippo 2 years ago

If you have multiple load balancers then doing round robin is hard because they need to talk to each other to ensure proper balancing.

The proposed algorithm should work sufficiently well for multiple load balancers without coordination, assuming the number of load balancers is small compared to the number of servers processing the load.