Comment by tptacek

3 years ago

The default last_server_contact timeout for Consul is 200ms. Can I have $200?

Maybe. The sticking point for me is that I’ve implemented enough distributed system protocols to know that even if a server occasionally drops out, the overall reliability of the service isn’t affected. I would be very curious to hear from someone in the field if they feel differently.

It’s easy to assume that a server dropout = less reliable network. But even if a leader election were happening every minute, it seems unlikely to drastically affect any ops in flight.

But sure, if they agree I’ll venmo you $200 too.

  • We operate a large Consul cluster (Consul is great, but we abuse the hell out of it). Frequent leader elections have been responsible for outages. Don't worry about the $200, I'm just fucking with you, but I don't think you're on very firm ground with this line of argument. It's fun to watch though, so I do hope you keep going with it. :)

    • Hmm. Thank you for the datapoint. It’s why I scaled the bet down to $100 for 200ms.

      I think it’s worth uncovering whether a 100ms delay could result in an outage. If I were on call, it’d be hard to sleep knowing that was true.

      The root claim is of course that disabling NDELAY can result in an outage. It still seems $200-unlikely that this could be true. Certainly it might cause performance problems, but the claim was reliability. Outages would put it firmly in the “unreliable” section of the Venn diagram.

      My claim about 1min leader reelections is admittedly more suspicious. It’s surprising the reelections caused outages. But I suppose if there were a lot of long-running operations that needed a total order, frequent reelections would hose that.

      5 replies →