Comment by tgsovlerkhgsel

2 months ago

There is a classic pattern with incident reports that's worth paying attention to: The companies with the best practices will look the worst. Imagine you see two incident reports from different factories:

1. An operator made a mistake and opened the wrong valve during a routine operation. 15000 liters of hydrochloric acid flooded the factory. As the flood started from the side with the emergency exits, it trapped the workers, 20 people died horribly.

2. At a chemical factory, the automated system that handles tank transfers was out of order. A worker was operating a manual override and attempted to open the wrong valve. A safety interlock prevented this. Violating procedure, the worker opened the safety interlock, causing 15000 liters of hydrochloric acid to flood the facility. As the main exit was blocked, workers scrambled towards an additional emergency exit hatch that had been installed, but couldn't open the door because a pallet of cement had been improperly stored next to it, blocking it. 20 people died horribly.

If you look at them in isolation, the first looks like just one mistake was made, while the second looks like one grossly negligent fuckup after another, making the second report look much worse. What you don't notice at first glance is that the first facility didn't have an automated system that reduced risk for most operations in the first place, didn't have the safety interlock on the valve, and didn't have the extra exit.

So, when you read an incident report, pay attention to this: If it doesn't look like multiple controls failed, often in embarrassing/bad/negligent/criminal ways, that's potentially worse, because the controls that should have existed didn't. "Human error took down production" is worse than "A human making a wrong decision overrode a safety system because they thought they knew better, and the presubmit that was supposed to catch the mistake had a typo". The latter is holes in the several layers of Swiss Cheese lining up, the former is only having one layer in the first place.

5 comments

tgsovlerkhgsel

hyperman1 2 months ago

I wish I had more upvotes for you. While the swiss cheese model is well known on HN by now,your post goes a little bit deeper. And reveals a whole new framework for reading incident responses. Thanks for making me smarter.

relaxing 2 months ago

I don’t understand the point of this theory. Not having safety controls is bad, but having practices so bad that workers violate N layers of safety protocol in the course of operation is also bad. They’re both problems in need of regulation.

oakwhiz 1 month ago

The failure rate of an individual layer of Swiss cheese should be bounded under most circumstances but not all. So you should probably have more layers when hazards cannot be eliminated.
tgsovlerkhgsel 1 month ago

I was trying to focus on one specific pattern without making my post too long. Alert fatigue, normalization of deviance etc. are of course problems that need to be addressed, and having a lot of layers but each with a lot of giant holes in them doesn't make a system safe.
My point was that in any competent organization, incidents should be rare, but if they still happen, they almost by necessity will read like an almost endless series of incompetence/malfeasance/failures, simply because the organization had a lot of controls in place that all had to fail for a report-worthy bad outcome.
Overall incident rates are probably a good way to distinguish between "well-run organization had a really unlucky day" and "so much incompetence that having enough layers couldn't save them" by looking at overall incident rates... and in this case, judging by the reports about how many accidents/incidents this company had, it looks like the latter.
But if you judge solely on a single incident report, you will tend to see companies that don't even bother with safety better than those that generally do but still got hit, and you should be aware of this effect and pay attention to distinguish between "didn't even bother", "had some safety layers but too much incompetence" and "generally does the right thing but things slipped through the cracks this one time".

orwin 1 month ago

Chernobyl reactor 4 explosion is a bit like this. Safety rules were ignored, again and again and again and again, two safety controls were manually deactivated (all within 2 hours), then bad luck happened (the control rod holes were deformed by the temperature), and then a design flaw (graphite on the extremity of the control rods) made everything worse until the worse industrial catastrophe of all time.