Comment by tgsovlerkhgsel

5 hours ago

There is a classic pattern with incident reports that's worth paying attention to: The companies with the best practices will look the worst. Imagine you see two incident reports from different factories:

1. An operator made a mistake and opened the wrong valve during a routine operation. 15000 liters of hydrochloric acid flooded the factory. As the flood started from the side with the emergency exits, it trapped the workers, 20 people died horribly.

2. At a chemical factory, the automated system that handles tank transfers was out of order. A worker was operating a manual override and attempted to open the wrong valve. A safety interlock prevented this. Violating procedure, the worker opened the safety interlock, causing 15000 liters of hydrochloric acid to flood the facility. As the main exit was blocked, workers scrambled towards an additional emergency exit hatch that had been installed, but couldn't open the door because a pallet of cement had been improperly stored next to it, blocking it. 20 people died horribly.

If you look at them in isolation, the first looks like just one mistake was made, while the second looks like one grossly negligent fuckup after another, making the second report look much worse. What you don't notice at first glance is that the first facility didn't have an automated system that reduced risk for most operations in the first place, didn't have the safety interlock on the valve, and didn't have the extra exit.

So, when you read an incident report, pay attention to this: If it doesn't look like multiple controls failed, often in embarrassing/bad/negligent/criminal ways, that's potentially worse, because the controls that should have existed didn't. "Human error took down production" is worse than "A human making a wrong decision overrode a safety system because they thought they knew better, and the presubmit that was supposed to catch the mistake had a typo". The latter is holes in the several layers of Swiss Cheese lining up, the former is only having one layer in the first place.

I wish I had more upvotes for you. While the swiss cheese model is well known on HN by now,your post goes a little bit deeper. And reveals a whole new framework for reading incident responses. Thanks for making me smarter.

Chernobyl reactor 4 explosion is a bit like this. Safety rules were ignored, again and again and again and again, two safety controls were manually deactivated (all within 2 hours), then bad luck happened (the control rod holes were deformed by the temperature), and then a design flaw (graphite on the extremity of the control rods) made everything worse until the worse industrial catastrophe of all time.

I don’t understand the point of this theory. Not having safety controls is bad, but having practices so bad that workers violate N layers of safety protocol in the course of operation is also bad. They’re both problems in need of regulation.

  • The failure rate of an individual layer of Swiss cheese should be bounded under most circumstances but not all. So you should probably have more layers when hazards cannot be eliminated.