Comment by thraxil

3 years ago

It's not intended to be rigorous. The context here is that Richard I. Cook, one of the main figures in safety and resilience engineering, who's published many, many papers on these topics died recently. The "How Complex Systems Fail" paper is intended to be a bit pithy and light; more an attempt at summarizing years of wisdom. See: https://www.adaptivecapacitylabs.com/blog/2022/09/12/richard...

Well, this sounds wrong to me:

> Catastrophe requires multiple failures – single point failures are not enough

My experience is that a single failure causes a cascade of subsequent failures. This topic is very interesting, but this post is more of a teaser of topics than a real explanation.

  • Places where a single failure in an otherwise perfectly functioning system can cause catastrophic outcomes are relatively easy to identify, relatively easy to argue need to be fixed and relatively easy to fix. As a result mature, complex systems have generally developed safety mechanisms for such issues. Once you have done that you need at least two failures (underlying issue + safety, hot+cold, or two interacting systems).

    I would suspect that your experience of single modes of failure being present are one of the following

    * Immature system (e.g. a startup) * One where failure is acceptable and so engineering isn't invested in solving these issues (i.e. the author is talking about disasters that kill people, not causing a few mins of ads not getthing shown) * Extreme organizational dysfunction (talking criminal negligence type stuff)

    • > Once you have done that you need at least two failures (underlying issue + safety, hot+cold, or two interacting systems).

      Ah, I missed the part where he said - except for distributed systems. The thing is, effectively all systems are distributed systems with two or more interacting subsystems.

      And no, I'm not talking about immature systems or ones where failure is acceptable. Queuing issues, for example, are well known to cause to cascading effects, and are not trivial to identify or solve.

      Even basic correctness issues can be very difficult to identify if you have a large permutation space and no model checking, and will also cascade.