← Back to context

Comment by toast0

8 hours ago

If process A is waiting for a reply from process B and process B is waiting for a reply from process A; that is deadlock. There is no way those processes can continue (unless there's a timeout or one process gets killed). Other processes may progress as long as they don't need a reply from process A or B ... which sometimes is fine. (Edit: nevermind, I forgot the 5 second timeout if you use gen_server:call/2; you will end up in livelock if it happens continuously, but a mostly ok system if it works out)

Livelock is something like you've got 1000 nodes that all want to do X, which requires an exclusive lock and the method to get an exclusive lock is:

Broadcast request to cluster

If you got the lock on all nodes, proceed

If you get the lock on all nodes, release and try again after a timeout

This procedure works in practice, when there is low contention. If the cluster is large and many processes contend for the lock, progress is rare. It's not impossible to progress, so the system is not deadlocked; but it takes an inordinate amount of time, mostly waiting for locks: the system is livelocked. In this case, whenever progress happens, future progress is easier.

This is a rough description of an actual incident with nodes joining pg2, I think around 2018... the new pg module avoids that lock (and IMHO, the lock was not needed anyway; it was there to provide consistent order in member lists across nodes, but member lists would no longer be consistent when dist distonects happened and resolved, so why add locks to be consistent sometimes). As an Erlang user with I think the largest clusters anywhere, we ran into a good number of these kinds of things in OTP. Ericsson built dist for telecom switches with two nodes in a single enclosure in a rack. It works over tcp and they didn't put explicit limits, so you can run a dist cluster with thousands of nodes in locations across the globe and it mostly works, but there will be some things to debug from time to time. Erlang is fairly easy to debug... All the new nodes have a process waiting to join pg2, what's the pg2 process doing, why does that lock not have the consensus building algorithm, can we add it? In the meantime, let's kill some nodes so others can progreas and then we'll run a sequenced start of the rest.