Comment by phyzome

2 days ago

I found less-than-great results in a simulation where there's a slight persistent difference between two of the options: https://www.brainonfire.net/blog/2019/07/21/load-balancing-b... (as part of a larger study on healthchecks that Don't Suck).

I think that simulation claim that pick-2 can send 2.5x as much traffic to most loaded vs least loaded is a bit misleading: if the load metric is completely random then that might happen. The more correlation to load the better. Also, rather than looking at the ratio of most loaded to least loaded, it might be better to look at the ratio of most loaded to average: that is, how much extra work did we send to a poor server. In that, pick-2 has an absolute cap of 2xing the load on a server.

  • Real world case where I've observed these load characteristics: A cluster of three Redis nodes, one of which is primary and therefore has slightly (but persistently) worse latency. Pick-2 would send significantly less read traffic to that node. Like you say, it's no worse than a 2x difference, but I'd prefer better balancing than that.

    (Pick-2 also can at most give 2x less traffic to a node with terrible performance, which is not awesome.)

Excellent read. It highlights key aspects like health checks, server restarts, warm up, and load shedding, all of which make load balancing an already hard problem even harder.