Comment by HolyLampshade

1 day ago

A long time ago I had a colleague turn me on to Sidney Dekker’s “Drift Into Failure”, which in many ways covers system design taking into account the “human” element. You could think of it as the “realists” approach to system safety.

At the time we operated some industry specific, but national scale, critical systems and were discussing the balance of the crucial business importance of agility and rapid release cycles (in our industry) against system fragility and reliability.

Turns out (and I take no credit for the underlying architecture of this specific system, though I’ve been a strong advocate for this model of operating) if you design systems around humans who can rapidly identify and diagnose what has failed, and what the up stream and down stream impacts are, and you make these failures predictable in their scope and nature, and the recovery method simple, with a solid technical operations group you can limit the mean-time-to-resolution of incidents to <60s without having to invest significant development effort into software that provides automated system recovery.

The issue with both methods (human or technical recovery) is that both are dependent on maintaining an organizational culture that fosters a deep understanding of how the system fails, and what the various predictable upstream and downstream impacts are. The more you permit the culture to decay the more you increase the likelihood that an outage will go from benign and “normal” to absolutely catastrophic and potentially company ending.

In my experience companies who operate under this model eventually sacrifice the flexibility of rapid deployment for an environment where no failure is acceptable, largely because of an lack of appreciation for how much of the system’s design is dependent on an expectation of the fostering of the “appropriate” human element.

(Which leads to further discussion about absolutely critical systems like aviation or nuclear where you absolutely cannot accept catastrophic failure because it results in loss of life)

Extremely long story short, I completely agree. Aviation (more accurately aerospace) disasters, nuclear disasters, medical failures (typically emergency care or surgical), power generation, and the military (especially aircraft carrier flight decks) are all phenomenal areas to look for examples of how systems can be designed to account for where people may fail in the critical path.

0 comments

HolyLampshade

No comments yet

Contribute on Hacker News ↗