Comment by mjb

1 year ago

> Crash-only software is actually more reliable because it takes into account from the beginning an unavoidable fact of computing - unexpected crashes.

This is a critical point for reliable single-machine systems, and for reliable distributed systems. Distributed systems avoid many classes of crashes through redundancy, allowing the overall system to recover (often with no impact) from the failure or crash of a single node. This provides an additional path to crash recovery: recovering from peers or replicas rather than from local state. In turn, this can simplify the tracking of local state (especially the kind of per-replica WAL or redo log accounting that database systems have to do), leading to improved performance and avoiding bugs.

But, as with single-system crashes, distributed systems need to deal with their own reality: correlated failures. These can be caused by correlated infrastructure failures (power, cooling, etc), by operations (e.g. deploying buggy software), or by the very data they're processing (e.g. a "poison pill" that crashes all the redundant nodes at once). And so, like the crash-only case with single-system software, reliable distributed systems need to be designed to recover from these correlated failure cases.

The constants are interestingly different, though. Single-system annual interrupt rates (AIR) are typically in the 1-10% range, while systems spread over multiple datacenters can feasibly see correlated failure rates several orders of magnitude lower. This could argue that having a "bad day" recovery path that's more expensive than regular node recovery is OK. Or, it could argue that the only feasible way of making sure that "bad day" recovery works is to exercise it often (which goes back to the crash-only argument).

0 comments

mjb

No comments yet

Contribute on Hacker News ↗