← Back to context

Comment by tptacek

3 years ago

We operate a large Consul cluster (Consul is great, but we abuse the hell out of it). Frequent leader elections have been responsible for outages. Don't worry about the $200, I'm just fucking with you, but I don't think you're on very firm ground with this line of argument. It's fun to watch though, so I do hope you keep going with it. :)

Hmm. Thank you for the datapoint. It’s why I scaled the bet down to $100 for 200ms.

I think it’s worth uncovering whether a 100ms delay could result in an outage. If I were on call, it’d be hard to sleep knowing that was true.

The root claim is of course that disabling NDELAY can result in an outage. It still seems $200-unlikely that this could be true. Certainly it might cause performance problems, but the claim was reliability. Outages would put it firmly in the “unreliable” section of the Venn diagram.

My claim about 1min leader reelections is admittedly more suspicious. It’s surprising the reelections caused outages. But I suppose if there were a lot of long-running operations that needed a total order, frequent reelections would hose that.

  • In fairness, I don't know if we kept the default. I'm responding to two independent things at this point: first, there are definitely systems where 200ms delays have rippling impacts, and second, leader elections aren't always benign.

    (Consul would, I'm sure, converge eventually regardless of the election frequency, but that doesn't mean everything that relies on Consul will tolerate those delays).

    I don't have much of a take here, beyond that I don't think you can extrapolate as much from what's on the 6.824 pages as you might have done here. Certainly, in a system where 200ms is the difference between "healthy" and "not healthy" status on a peer relationship, I'd think you'd want Nagle disabled. But I haven't thought carefully about this, or looked that closely at the typical packet flow between Consul nodes. I could be wrong about all of this; more reason not to give me any money.

    Later

    Per the comment upthread, I haven't even bothered to check which parts of this packet flow are even TCP to begin with.